library(tidyverse)
library(janitor)
library(readxl)
library(skimr)
library(emmeans)
library(survival)
library(coin)
library(stats)
library(rstatix)
library(ggpubr)
library(patchwork)
library(grid) Analysis of Verbal Communication Differences between Autistic and Non-autistic Adults
1 Project Outcomes
By completing this project, I gained experience in:
- Cleaning and analysing real-world data
- Applying statistical tests
- Visualising research findings
- Using R, the Tidyverse package, and other supporting packages
2 Introduction
For my dissertation, I analysed communication differences between autistic and non-autistic adults. My aim was to investigate how verbal communication is linked to rapport-building for each neurotype. Specifically, I focused on verbal engagement and verbal backchanneling, exploring how these communicative behaviours varied across different neurotype pairs and their connection with rapport-building.
As part of this research, I extracted data from a series of pre-recorded interactions. Each interaction was centred around learning a new skill, and involved two people, either with matched or mixed neurotypes (see Fig. 1A). The data I extracted included the total length of each interaction, as well as the verbal engagement and backchanneling of the participant. Additionally, I used the rapport scores given by each participant’s interaction partner at the end of the interaction (see Fig. 1B).
Figure 1. Summary of the research design (A and B). Panel A shows the grouping of participants into pairs, either with matched neurotypes (autistic-autistic or non-autistic–non-autistic) or mixed neurotypes (autistic–non-autistic). In the figure, autistic individuals are represented in plum, while non-autistic individuals are shown in grey. Panel B depicts the data collected. In each pair, I collected verbal communication (backchanneling and engagement) data from the participant and rapport scores from their interaction partner.
By analysing this data, I aimed to answer the following questions:
Do people interact for longer, shorter, or the same amount of time when paired with someone of the same neurotype?
Do people experience a stronger sense of rapport when paired with someone of the same neurotype?
Are people more verbally engaged when interacting with someone of the same neurotype? Is this pattern consistent for both autistic and non-autistic individuals?
Does verbal backchanneling occur more frequently when both individuals are of the same neurotype? Is this pattern consistent for both autistic and non-autistic individuals?
Does higher verbal engagement and/or backchanneling contribute to building rapport? And is this relationship consistent across different neurotype pairs?
Based on previous research, I expected interaction length to be roughly similar across matched and mixed pairs, but rapport scores to be higher in matched pairs. I also anticipated that both autistic and non-autistic pairs would spend a greater proportion of time verbally engaging with their interaction partner than those in mixed pairs. Additionally, I predicted that higher verbal engagement would be linked to higher rapport scores within matched pairs. In contrast, I expected only non-autistic pairs to produce backchannels more frequently than mixed or autistic pairs, with a higher rate of backchanneling contributing to stronger rapport specifically in those non-autistic pairs (see Table 1). For the (admittedly debatable) explanation behind these predictions, see my Honours Project (Section Introduction: Aims and Hypotheses).
Table 1. Summary of expected research outcomes.
| Variable | Autistic Pairs | Non-autistic Pairs | Mixed Pairs |
|---|---|---|---|
| Interaction Duration | |||
| Rapport | |||
| Verbal Engagement | |||
| Engagement and Rapport Relationship | |||
| Verbal Backchanneling | |||
| Backchanneling and Rapport Relationship |
In this project, I work with the data to test if these predictions hold. I move through key stages of data analytics — starting with processing my raw data, then analysing the cleaned datasets, and finally sharing my findings.
Below is the R code I use at each stage of the project, along with detailed commentary to demonstrate my data handling skills in R.
2.1 Libraries
Before I begin, here is a list of all the supporting packages I use during this project:
3 Processing Data
Once the research objectives are in place and the necessary data prepared, it is time to process the data. The process phase is essentially when you clean and organise your data. This involves tasks like fixing errors and inconsistencies, transforming data into a more usable format, and combining multiple datasets. In short — making sure the data is ready before diving into analysis.
And that’s exactly what I’ll be doing at this stage of the project. I’ll start by importing my data, then introduce the validation criteria I use to check data integrity and guide me through the cleaning process. I’ll clean the data, aggregate it, transform it into wide and long format tables, and finally combine datasets so I’ve got everything I need for the analysis stage.
Without further ado, let’s get started!
3.1 Data Import
The first challenge I encounter is that my verbal engagement and backchanneling data are split across 59 separate .txt files, each corresponding to a single participant. I need to combine these files into one dataset while ensuring that each entry can still be traced back to the original participant for later analysis. This process proves to be a bit tricky and takes some time to figure out. Ultimately, I settle on writing a function:
# Before creating a function, I make some preparations.
# First, I point R to the folder where my data are stored.
# In my case, I have a folder called "raw_data" in the working directory:
data_dir <- "raw_data"
# Next, I list all the .txt files in that folder.
# Each .txt file corresponds to a single participant and contains data on their
# verbal communication:
data_files <- list.files(data_dir, pattern = "\\.txt$", full.names = TRUE)
# All .txt files follow the same structure and share the same column names:
column_names <- c("measurement", "programme_version", "start", "finish", "length", "type")
# Finally, I create an empty list to temporarily store each participant's data:
participant_data <- list()
# Now that I have everything ready, I write a function to loop through each file,
# apply column names, add participant ID, and import the data into the list:
for (file in data_files) {
# Read the data and apply column names:
data <- read.table(file, header = FALSE, sep = "\t", col.names = column_names)
# Extract the participant ID by removing the '.txt' extension from the file name:
participant_id <- gsub("\\.txt$", "", basename(file))
# Add a new column to track the participant ID:
data$participant_id <- participant_id
# Store the cleaned data in the list:
participant_data[[participant_id]] <- data
}
# Then, I combine all individual data frames into one large data frame:
communication <- do.call(rbind, participant_data)
# and reset row names for clarity:
rownames(communication) <- NULL
# We’ve got our data imported. Perfect!
# Before finishing up, I remove the intermediate variables to keep things neat and nice:
rm(list = c("data", "participant_data", "column_names", "data_dir", "data_files",
"file", "participant_id"))In addition, I upload another dataset that contains participants’ rapport scores. This turns out to be much more straightforward since the data is stored in a single Excel table and can be easily uploaded using read_excel() function:
rapport <- read_excel("raw_data/demo_rapport.xlsx") |>
# clean_names(): makes sure the variable names are clean and
# follow a consistent format:
clean_names()I now have two datasets to work with: communication and rapport. I need to check each one to ensure that everything is in order and clean up anything that needs fixing. This process — called data validation and cleaning — makes sure the data is reliable and ready for analysis.
3.2 Validation Checklist
Data validation is the process of ensuring the integrity of the data - making sure it’s trustworthy and reliable enough to base your analysis on. To check for data integrity, I have compiled the following checklist:
I’ll explore each dataset by focusing on its variables and checking whether the data meets these criteria. If it doesn’t, I’ll clean it up.
I’ll start with the communication dataset.
3.3 Dataset communication Clean-Up
First, I take a quick look at the overall structure of the dataset using str() function:
str(communication)'data.frame': 1754 obs. of 7 variables:
$ measurement : chr "Interaction" "Engagement" "Non-verbal" "Non-verbal" ...
$ programme_version: int 10 10 10 10 10 10 10 10 10 10 ...
$ start : num 40 40 44.8 47.9 72 ...
$ finish : num 136.6 136.6 45.8 48.9 73 ...
$ length : num 96.6 96.6 1 1 1 ...
$ type : chr "" "Low" "" "" ...
$ participant_id : chr "P10(1)" "P10(1)" "P10(1)" "P10(1)" ...
The dataset contains 7 variables and 1,754 observations. To ensure that this is accurate, I check that there are no duplicate rows in the dataset using distinct() function:
str(distinct(communication))'data.frame': 1754 obs. of 7 variables:
$ measurement : chr "Interaction" "Engagement" "Non-verbal" "Non-verbal" ...
$ programme_version: int 10 10 10 10 10 10 10 10 10 10 ...
$ start : num 40 40 44.8 47.9 72 ...
$ finish : num 136.6 136.6 45.8 48.9 73 ...
$ length : num 96.6 96.6 1 1 1 ...
$ type : chr "" "Low" "" "" ...
$ participant_id : chr "P10(1)" "P10(1)" "P10(1)" "P10(1)" ...
# The number of observations is the same. No duplicates.There are no duplicates. The 7 variables in this dataset are:
measurement,
programme version,
measurement’s start time,
measurement’s finish time,
measurement’s length,
measurement’s type, and
participant’s ID.
I will now go through each variable step by step and clean the data for each field where appropriate.
3.3.1 Variable measurement
Data in the measurement variable should only match the categories from the table below (Table 2).
Table 2. Definitions of measurement categories.
| Measurement | Definition |
|---|---|
| Interaction | Shows the duration of the interaction between two participants in each pair. For further details about the task and roles, see my Honours Project (Section Methods: Dyadic interactions). |
In most cases, there was one interaction measurement per pair. However, if the interaction was interrupted (e.g., the researcher entered the room), multiple measurements were recorded (e.g., one before the interruption and another after) (see Fig. 3). Figure 3. Interaction Measurement. The figure illustrates an example of an interaction that was interrupted by the researcher and then resumed. |
|
| Engagement | Each interaction is divided into periods of High, Low and Swap engagement based on the participant’s verbal activity. For the definition of each engagement level see Table 4. |
In most cases, High, Low and Swap engagement periods alternated, resulting in multiple engagement measurements. The sum of all engagement measurements should equal the total length of the interaction (see Fig. 4). Figure 4. Engagement Measurement. Each interaction was broken down into engagement periods (EnP), with each period classified as High, Low, or Swap. The figure illustrates an example of an interaction divided into four engagement periods. |
|
| Backchannel Responses | Shows an instance of the participant’s backchannel response (i.e., Backchanneling). |
Backchannel responses are further split into the following categories: Laughs, Non-lexical speech, Lexical speech and Short phrases (see Fig. 5). Figure 5. Backchanneling Measurement. All backchannel responses were recorded and categorised into one of four groups. The figure illustrates an example of an interaction with five backchannel responses: three belonged to the Lexical Speech group (dark purple), one to the Non-Lexical Speech group (bright green), and one to the Laughs group (grey). |
To start, I check if the data in this field matches the categories above using unique() function.
# To check that, I list all unique values in the `measurement` field:
unique(communication$measurement)[1] "Interaction" "Engagement" "Non-verbal"
[4] "Non-verbal Engagement" "Laughs" "Non-lexical speech"
[7] "Lexical speech" "Short phrases"
All values fall within the acceptable range, with the exception of “Non-verbal” and “Non-verbal Engagement”. Since my focus is on verbal communication, these values are irrelevant to my analysis and can be removed. Additionally, I need to indicate that all individual categories of backchannel responses belong to one measurement group - Backchanneling.
To address these points, I will:
- Filter out all “Non-verbal Engagement” and “Non-verbal” values using
filter()function. - Group all backchannel responses into a single category, “Backchanneling” using
case_when()function.
In the process, I create a new dataset - communication_clean to store the cleaned data.
# Create a new dataset to store cleaned data:
communication_clean <- communication |>
# Remove "Non-verbal Engagement" and "Non-verbal" backchannel responses:
filter(measurement != "Non-verbal Engagement" & measurement != "Non-verbal") |>
# Rename all backchannel responses into a single category: "Backchanneling":
mutate(measurement =
case_when(
measurement == "Laughs" ~ "Backchanneling (Laughs)",
measurement == "Lexical speech" ~ "Backchanneling (Lexical Speech)",
measurement == "Non-lexical speech" ~ "Backchanneling (Non-lexical Speech)",
measurement == "Short phrases" ~ "Backchanneling (Short Phrases)",
TRUE ~ measurement)
)
# Did it work?
unique(communication_clean$measurement)[1] "Interaction" "Engagement"
[3] "Backchanneling (Laughs)" "Backchanneling (Non-lexical Speech)"
[5] "Backchanneling (Lexical Speech)" "Backchanneling (Short Phrases)"
# It did!Now, all data in this field falls within the acceptable range and is relevant. The data is accurate (no duplicate values due to misspellings) and complete (there are no explicit missing values). [Note: I will check for implicit blank values later in Data Aggregation.]
Almost all validation criteria for clean data have been addressed, except for two: specific structure and data type.
The data in this field doesn’t follow a specific structure, so the specific structure criterion doesn’t apply here.
The data should be qualitative and nominal, meaning it falls into a small set of distinct categories without any inherent order. In R, this type of data is best represented as a factor.
# Check the data type:
str(communication_clean$measurement) chr [1:1738] "Interaction" "Engagement" "Interaction" ...
Currently, the data is stored as a character (also referred to as a string). To align it with the proper data type, I transform it into a factor using mutate() and as.factor() functions:
communication_clean <- communication_clean |>
# Convert `measurement` to a factor:
mutate(measurement = as.factor(measurement))
# Was the transformation successful?
str(communication_clean$measurement) Factor w/ 6 levels "Backchanneling (Laughs)",..: 6 5 6 1 1 3 3 3 3 2 ...
levels(communication_clean$measurement)[1] "Backchanneling (Laughs)" "Backchanneling (Lexical Speech)"
[3] "Backchanneling (Non-lexical Speech)" "Backchanneling (Short Phrases)"
[5] "Engagement" "Interaction"
# Yes, it was!The data now matches the data type defined for the field. With this final step, all validation criteria for the variable measurement have been checked, and the data is fully cleaned. I can move on to the next variable.
3.3.2 Variable programme_version
The programme version remained the same during data prepartion stage and doesn’t contribute any meaningful information to the analysis. Since it’s irrelevant to my goals, I remove it from the dataset using select() function.
communication_clean <- communication_clean |>
select(-programme_version)3.3.3 Variables start, finish and length
The start, finish, and length variables are interconnected and best reviewed together.
start: Indicates when the measurement started.finish: Indicates when the measurement finished.length: Indicates the duration of the measurement (calculated as the difference betweenstartandfinishtimes).
I only need the length variable for my analysis, so the start and finish variables could be removed. However, I’m not going to do it just yet, since these variables could be used in data validation process.
I begin by checking if the data matches the expected data type. Since these variables represent qualitative, continuous measurements, they should be stored as numeric values in R.
# Check the data type for each variable:
communication_clean |>
select(start, finish, length) |>
str()'data.frame': 1738 obs. of 3 variables:
$ start : num 40 40 76.2 81.3 429.9 ...
$ finish: num 136.6 136.6 447.8 82.3 430.9 ...
$ length: num 96.6 96.6 371.6 1 1 ...
# All variables match the expected data type!The data matches the expected type. Next, I verify that there are no missing values and that all values fall within the acceptable range and are non-negative. I do this using filter(), if_any() and lambda functions:
communication_clean |>
filter(
# In R. missing numeric values are represented as `NA`.
# Check for missing values (NA) across the fields:
if_any(start:length, is.na) |
# Ensure no values are negative using lambda function:
if_any(start:length, \(x) x < 0)
)[1] measurement start finish length type
[6] participant_id
<0 rows> (or 0-length row.names)
There are no missing or negative values, confirming that the data is complete and falls within the acceptable range. The data does not follow any specific structure so this criterion does not apply here.
Finally I check the data for accuracy and consistency across the fields. To meet these criteria, the following conditions must hold:
- The
lengthvalues must equal the difference betweenfinishandstartvalues. - For each participant, the sum of all interaction
lengthvalues must match the sum of all engagementlengthvalues (see Table 2). - All
lengthvalues for backchanneling should be equal to 1 (see Table 2).
Let’s first check if the length variables were calculated correctly and equal the difference between finish and start values. Here’s how I go about it:
# First, I create a new temporary dataset `finish_start_valid` to store
# variables used for validation.
finish_start_valid <- communication_clean |>
# Add a new column `diff` containing the difference between `finish` and `start`:
mutate(diff = finish - start,
# Round `diff` to 3 decimal places:
diff = round(diff, 3),
# Create a new column comparing `diff` with `length`, returning
# TRUE if they match or FALSE if they don't.
# Variable `length` is also rounded to 3 decimals for consistency.
diff_matches_length = diff == round(length, 3)
)
summary(finish_start_valid$diff_matches_length) Mode FALSE TRUE
logical 2 1736
# This shows two FALSE values, meaning there are two mismatched values.
# Let's check which they are:
finish_start_valid |>
filter(diff_matches_length == FALSE) measurement start finish length type participant_id diff
1 Interaction 2.589 249.263 223.500 P61(1) 246.674
2 Engagement 4.100 92.384 68.246 Low P68(1) 88.284
diff_matches_length
1 FALSE
2 FALSE
We can see that the length values are smaller than they should be for Interaction measurement in participant 61 (223.500 vs 246.674) and Engagement measurement in participant 68 (68.246 vs 88.284). There are two possible explanations for this discrepancy: either the length values were calculated incorrectly, or the start and finish values are inaccurate.
To figure out which explanation is more likely, let’s take a closer look at each participant’s data.
As a reminder, the total length of all engagement periods should add up to the total length of the interaction. If the start and finish values are wrong but the length values are correct, this relationship should still hold. But if the length values were miscalculated, the total engagement and interaction lengths won’t match — and the difference between them should be the same as the gap between the expected and recorded length values.
Let’s first examine the data for participant 61:
# Participant 61
P61 <- communication_clean |>
# Filter for participant 61 and the "Interaction" and "Engagement" measurements:
filter(participant_id == "P61(1)" & measurement %in%
c("Interaction", "Engagement")) |>
# Group by measurement type and calculate the total length for each:
group_by(measurement) |>
summarise(total_length = sum(length))
# Check the total lengths:
print(P61)# A tibble: 2 × 2
measurement total_length
<fct> <dbl>
1 Engagement 247.
2 Interaction 224.
# Does the difference between the total lengths of engagement and
# interaction match the difference between
# 223.500 and 246.674?
near((P61[1,2] - P61[2,2]), (246.674 - 223.500)) total_length
[1,] TRUE
# It does match!The total length of all engagement measurements does not equal the total length of the interaction measurement. Instead, the difference between the total lengths matches the discrepancy between the recorded and expected length values (246.674 - 223.500). This suggests that the length value was miscalculated.
Now let’s check if the same pattern holds for participant 68:
# Participant 68
# I perform the same transformations as for participant 61:
P68 <- communication_clean |>
filter(participant_id == "P68(1)" & measurement %in%
c("Interaction", "Engagement")) |>
group_by(measurement) |>
summarise(max_length = sum(length))
print(P68)# A tibble: 2 × 2
measurement max_length
<fct> <dbl>
1 Engagement 87.8
2 Interaction 108.
# Does the difference between the total lengths of engagement and
# interaction match the difference between 88.284 and 68.246?
near((P68[2,2] - P68[1,2]), (88.284 - 68.246)) max_length
[1,] TRUE
Again, we observe the same pattern. Based on these findings, I conclude that the length values were calculated incorrectly for both participants.
Let’s correct these errors by adding a new column with the recalculated lengths using mutate() function and removing the old column using select() function:
communication_clean <- communication_clean |>
# Add a new column with the corrected `length` values.
# [Length is expressed in seconds - hence the symbol _s]
mutate(measurement_length_s = finish - start,
measurement_length_s = round(measurement_length_s, 3)
) |>
# Remove the old column:
select(-length) |>
# Relocate the new column to appear after the 'finish' column:
relocate(measurement_length_s, .after = finish)
# Remove datasets that are no longer necessary:
rm(list = c("finish_start_valid", "P61", "P68"))Now, when the first condition is met, I am going to check the second condition ensuring that the total interaction length equals the total engagement length for each participant using group_by() and summarise() functions.
interaction_engagement_valid <- communication_clean |>
# Filter out unnecessary measurement values:
filter(measurement != "Backchanneling") |>
# Group the data by each participant and each measurement type to
# calculate the total duration for each combination of participant and
# measurement:
group_by(participant_id, measurement) |>
# Round total duration values to three decimal places to avoid
# floating-point precision errors.
# Drop the grouping after calculation:
summarise(total_length = round(sum(measurement_length_s),3),
.groups = "drop") |>
# Pivot the table so that interaction and engagement duration
# are in separate columns:
pivot_wider(names_from = measurement, values_from = total_length) |>
# Add a new column to compare length of each measurement for each participant:
mutate(length_valid = Interaction == Engagement)
# Are there any mismatched length values?
interaction_engagement_valid |>
filter(length_valid == FALSE) |>
nrow()[1] 44
# 44 participants out of 59 have mismatched length values.
# How large are these differences?
interaction_engagement_valid |>
# Filter only for mismatched values:
filter(length_valid == FALSE) |>
# Calculate how large the differences between mismatched values are.
# Express differences as percentages and ensure they are positive:
mutate(difference = abs((Interaction - Engagement)/Interaction * 100)) |>
# Visualise differences using a histogram:
ggplot(aes(x = difference)) +
geom_histogram()`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# Fortunately, most of the differences are quite small.
# Remove the dataset to keep things tidy:
rm("interaction_engagement_valid")Out of 59 participants, 44 have mismatched total interaction and engagement lengths. However, the histogram above shows that most differences are minimal (<0.5%), suggesting that there was no significant data loss.
Instead, these discrepancies are likely due to minor inconsistencies in the start and finish times across different fields, leading to slightly different length values. The inconsistencies may arise from one or both of the following scenarios:
Within engagement measurements: Overlapping or disconnected
startandfinishtimes between individual engagement periods may slightly inflate or reduce the total engagement length.Between interaction and engagement measurements: The
startandfinishtimes for interaction may not align perfectly with the summed engagementstartandfinishtimes.
Although these differences appear negligible and are unlikely to impact the analysis significantly, I have decided to address them. I start with the second scenario and check for inconsistencies in the timing between interaction and engagement measurements:
# First, I extract the minimum `start` and maximum `finish` times for
# each participant's interaction.
# I create a new dataset called `interaction_boundaries` to store this data:
interaction_boundaries <- communication_clean |>
# Filter for interaction measurements only:
filter(measurement == "Interaction") |>
# Group the data by participant to calculate time boundaries:
group_by(participant_id) |>
# Summarise the earliest start and latest finish times.
# Why do I need to do that? Even though in most cases each participant only have
# one measurement for interaction, some participants have several measurements
# (for some of the participants interaction was interrupted)
summarize(min_start_in = min(start),
max_finish_in = max(finish))
# Similarly, I extract the minimum start and maximum finish times for each participant's
# engagement and store them in a new dataset called `engagement_boundaries`:
engagement_boundaries <- communication_clean |>
# Filter for engagement measurements only:
filter(measurement == "Engagement") |>
# Group the data by participant to calculate time boundaries:
group_by(participant_id) |>
# Summarise the earliest start and latest finish times:
summarise(min_start_en = round(min(start), 3),
max_finish_en = round(max(finish), 3))
# To compare these boundaries, I combine the two datasets into a single one:
boundaries_compare <- interaction_boundaries |>
left_join(engagement_boundaries, by = "participant_id")
# Next, I check for mismatches in the start and finish times for each participant.
# Check if the minimum start times match:
boundaries_compare |>
filter(round(min_start_in, 3) != round(min_start_en, 3)) |>
nrow()[1] 0
# Output: 0 participants have mismatched start times.
# Check if the maximum finish times match:
boundaries_compare |>
filter(round(max_finish_in, 3) != round(max_finish_en, 3)) |>
nrow()[1] 43
# Output: 43 participants have mismatched finish times.
# Now, I check if the engagement measurements are consistently shorter
# than interaction measurements:
boundaries_compare |>
filter(max_finish_in > max_finish_en) |>
nrow()[1] 43
# Output: Engagement measurements are shorter for all 43 participants.
# Finally, I clean up by removing temporary datasets:
rm(list = c("boundaries_compare", "engagement_boundaries", "interaction_boundaries"))For 43 participants the summed engagement ended earlier than the interaction, suggesting a systematic issue where the latest engagement period did not fully align with the corresponding interaction ending time.
To address this, I adjust the finish times of the last engagement measurement for each participant to align with the corresponding interaction ending time using mutate() and if_else() functions:
communication_clean <- communication_clean |>
# Group by participant
group_by(participant_id) |>
# Then use transformation with the condition that the value finish should be the highest
# within engagement measurements for a given participant.
# If condition is met, replace the finish value with the measurement length value.
# If condition is not met leave it as it is.
mutate(
finish = if_else(
measurement == "Engagement" & finish == max(finish[measurement == "Engagement"]),
max(finish[measurement == "Interaction"]),
finish
)
) |>
ungroup() |>
# Recalculate length
mutate(measurement_length_s = finish - start)After this, I check if the total number of mismatched values has been reduced:
interaction_engagement_valid <- communication_clean |>
# I have used this code before so for more detailed explanation see above.
filter(measurement != "Backchanneling") |>
group_by(participant_id, measurement) |>
summarise(total_duration = round(sum(measurement_length_s), 3),
.groups = "drop") |>
pivot_wider(names_from = measurement, values_from = total_duration) |>
mutate(duration_valid = Interaction == Engagement)
# Check how many mismatched values remain:
interaction_engagement_valid |>
filter(!duration_valid)# A tibble: 5 × 8
participant_id Engagement Interaction `Backchanneling (Laughs)`
<chr> <dbl> <dbl> <dbl>
1 P2 879. 879. 12
2 P46(1) 569. 569. 11
3 P58(1) 454. 454. 8
4 P64(1) 224. 224. 3
5 P67(1) 378. 378. 14
# ℹ 4 more variables: `Backchanneling (Lexical Speech)` <dbl>,
# `Backchanneling (Non-lexical Speech)` <dbl>,
# `Backchanneling (Short Phrases)` <dbl>, duration_valid <lgl>
At this point, there are only 5 mismatched values remaining. Let’s take a closer look at them:
communication_clean |>
filter(participant_id %in% c("P58(1)", "P46(1)", "P2", "P64(1)", "P67(1)") &
!str_detect(measurement, "Backchanneling"))# A tibble: 116 × 6
measurement start finish measurement_length_s type participant_id
<fct> <dbl> <dbl> <dbl> <chr> <chr>
1 Interaction 18.0 690. 672. "" P2
2 Interaction 726. 933. 206. "" P2
3 Engagement 18.0 25.2 7.20 "High" P2
4 Engagement 25.2 67.4 42.2 "Low" P2
5 Engagement 67.4 78.4 11.0 "High" P2
6 Engagement 78.4 99.2 20.8 "Low" P2
7 Engagement 99.2 103. 3.80 "High" P2
8 Engagement 103. 122. 19.0 "Low" P2
9 Engagement 122. 143. 20.6 "High" P2
10 Engagement 143. 154. 11.4 "Low" P2
# ℹ 106 more rows
After investigating each case individually, I identify where the issues are and how to resolve them (see Table 3).
Table 3. Summary of remaining issues and corresponding solutions
| Participant | Issue | Solution |
|---|---|---|
| P2 | Two separate interaction measurements. As a result, the finish time of the last engagement period within the first interaction was not adjusted. | Correct the latest engagement within the first interaction period. |
| P46 | Two seperate interaction measurements. As a result, the finish time of the last engagement period within the first interaction was not adjusted. | Correct the latest engagement within the first interaction period. |
| P58 | Time gap between two engagement periods. | Remove the gap time from the interaction. |
| P64 | Time gap between two engagement periods. | Remove the gap time from the interaction. |
| P67 | Two seperate interaction measurements. As a result, the finish time of the last engagement period within the first interaction was not adjusted. | Correct the latest engagement within the first interaction period. |
For participants with two interaction measurements (P2, P46, and P67), the finish time of the engagement period must align with the end of the first part of the interaction. For participants with time gaps between two engagement periods (P58 and P64), interaction duration must be adjusted to exclude the gap time, as it cannot be definitively attributed to either engagement period. Since there are only 5 data points that needs to be corrected, I do it manually using data from the tibble above and functions mutate() and case_when():
communication_clean <- communication_clean |>
mutate(
finish = case_when(
# Adjust the finish time for the engagement measurement of "P2":
measurement == "Engagement" &
participant_id == "P2" & finish == 690.400 ~ 690.411,
# Adjust for engagement measurement of "P46(1)":
measurement == "Engagement" &
participant_id == "P46(1)" & finish == 378.800 ~ 378.810,
# Adjust for engagement measurement of "P67(1)":
measurement == "Engagement" &
participant_id == "P67(1)" & finish == 165.477 ~ 165.478,
# In all the other cases keep the original value:
TRUE ~ finish
),
) |>
mutate(
# Recalculate duration for all rows
measurement_length_s = finish - start,
measurement_length_s = case_when(
# Override duration for "P58(1)" interaction with a total engagement value:
participant_id == "P58(1)" & measurement == "Interaction" ~ 453.700,
# Override duration for "P64(1)" interaction with a total engagement value:
participant_id == "P64(1)" & measurement == "Interaction" ~ 224.088,
TRUE ~ measurement_length_s
)
)Finally (that was a long run), I check if my efforts were successful and all values now match:
interaction_engagement_valid <- communication_clean |>
filter(measurement != "Backchanneling") |>
group_by(participant_id, measurement) |>
summarise(total_duration = round(sum(measurement_length_s), 3), .groups = "drop") |>
pivot_wider(names_from = measurement, values_from = total_duration) |>
mutate(duration_valid = Interaction == Engagement)
# Check if there are any remaining mismatched values:
interaction_engagement_valid |>
filter(!duration_valid) |>
nrow()[1] 0
# 0 rows - Yes!
# Clear space and remove `interaction_engagement_valid` dataset
rm("interaction_engagement_valid")With the last transformation, our second condition has been met and we can move on to the third and final condition. Here everything is much easier. Each individual backchanneling measurement represents a single backchannel response and its ‘length’ should be equal to 1. Let’s make this happen using mutate() and if_else() functions:
# Ensure backchanneling duration is always 1 second:
communication_clean <- communication_clean |>
mutate(
measurement_length_s = if_else(
str_detect(measurement, "Backchanneling"),
1,
# # Keep the recalculated duration for other rows
measurement_length_s
)
)
# Let's check if that worked:
communication_clean |>
filter(str_detect(measurement, "Backchanneling") & measurement_length_s != 1) |>
nrow()[1] 0
# 0 observations - It worked! Now that all the conditions have been met, I can confidently say the data is accurate and consistent across fields. The only task left is to remove the unnecessary start and finish columns, leaving only the relevant data for analysis.
communication_clean <- communication_clean |>
# Drop columns no longer needed:
select(-c(start, finish)) That being done we can move to the next variable.
3.3.4 Variable type
The type variable indicates the engagement level for each engagement period, as well as the specific utterance for each instance of backchanneling (e.g., hmmm, yeah, right etc.). The engagement levels - High, Low, or Swap (see Table 4) were defined in relation to the task that was at the core of each interaction. For further details, see my Honours Project (Section Methods: Dyadic Interactions).
Table 4. Summary of Engagement Levels. For full definitions, see my Honours Project (Section Methods: Appendix Methods).
| Engagement Level | Definition |
| High | Participant frequently engages with their interaction partner, asks questions, responses, backhannels etc. |
| Swap | Participant switches task roles with their interaction partner (i.e., they take on the role of the demonstrator). For further details about the interaction roles, see my Honours Project (Section Methods: Dyadic interactions). |
| Low | Participant does not/rarely engages with their interaction partner, asks questions, responses, backhannels etc. |
Since the type variable contains two distinct types of data, I decide to restructure it.
First, I check that the engagement data is complete and clean:
communication_clean |>
# Filter for "Engagement" measurements
filter(measurement == "Engagement") |>
# Identify unique values in the type column
distinct(type)# A tibble: 7 × 1
type
<chr>
1 "Low"
2 "High"
3 "Swap"
4 " High"
5 "High "
6 " High"
7 "Low "
The data is complete, with no missing values. However, there are some duplicates, likely caused by leading or trailing whitespace. Let’s remove those extra witespaces using trimws() function:
communication_clean <- communication_clean |>
# Remove leading/trailing whitespaces from the `type` column:
mutate(
type = trimws(type),
)
communication_clean |>
filter(measurement == "Engagement") |>
distinct(type)# A tibble: 3 × 1
type
<chr>
1 Low
2 High
3 Swap
With the duplicates removed, the data is now accurate and falls within the acceptable range of values. As the data doesn’t follow a specific structure, I can proceed without any further changes.
Next, I move the engagement level information from the type variable into the measurement variable, making sure each engagement period displays its engagement level in parentheses using mutate(), if_else(), and paste() functions:
communication_clean <- communication_clean |>
# Update the `measurement` column with the information from the `type` column:
mutate(
measurement = if_else(
# If the measurement is "Engagement"
measurement == "Engagement",
# Append the type value within brackets:
paste(measurement, " ", "(", type, ")", sep = ""),
# If not leave as it is:
measurement
)
)
# Check that the change was successful:
unique(communication_clean$measurement)[1] "Interaction" "Engagement (Low)"
[3] "Backchanneling (Laughs)" "Backchanneling (Non-lexical Speech)"
[5] "Backchanneling (Lexical Speech)" "Backchanneling (Short Phrases)"
[7] "Engagement (High)" "Engagement (Swap)"
# Success!Finally, I convert the measurement variable back to a factor to ensure it still matches the correct data type:
communication_clean <- communication_clean |>
# Convert `measurement` to a factor:
mutate(measurement = as.factor(measurement))Now that the engagement level has been incorporated into the measurement field, I tidy the type column so it only contains backchannel responses using the case_when() function:
communication_clean <- communication_clean |>
mutate(type = case_when(
!str_detect(measurement, "Backchanneling") ~ NA,
str_detect(measurement, "Backchanneling") & type == "" ~ "Unknown",
TRUE ~ type
))Next, I work through each backchanneling category one by one to check and refine the responses.
Starting with the Laugh group:
communication_clean |>
filter(measurement == "Backchanneling (Laughs)") |>
group_by(type) |>
summarise(count = n()) |>
arrange(desc(count), .by_group = TRUE)# A tibble: 1 × 2
type count
<chr> <int>
1 Unknown 273
I see that there are no distinct backchannel responses in this group — all values are blank, and therefore labelled as “Unknown”. I replace these with “Laugh” using if_else():
communication_clean <- communication_clean |>
mutate(type = if_else(
measurement == "Backchanneling (Laughs)",
"Laugh",
type
))
communication_clean |>
filter(measurement == "Backchanneling (Laughs)") |>
distinct(type) # A tibble: 1 × 1
type
<chr>
1 Laugh
Next, I check the Short Phrases group:
communication_clean |>
filter(measurement == "Backchanneling (Short Phrases)") |>
group_by(type) |>
summarise(count = n()) |>
arrange(desc(count), .by_group = TRUE)# A tibble: 18 × 2
type count
<chr> <int>
1 rep 33
2 I see 12
3 got it 5
4 I am seeing it 2
5 I know 2
6 there you go 2
7 I am with you 1
8 I am with you, kind of 1
9 I can see that 1
10 I get you 1
11 I know what you mean 1
12 I see it 1
13 I see that 1
14 Look at that 1
15 Look what you have done 1
16 look at that 1
17 now I see 1
18 that makes sense 1
This group is quite small, with the highest count being a repetition of the interaction partner’s utterance (“rep”) and the rest being interjections of various short phrases. Thus, I decide to split the data into two categories:
Short Phrase Repetition
Short Phrase Interjection
I update these using case_when():
communication_clean <- communication_clean |>
mutate(type = case_when(
measurement == "Backchanneling (Short Phrases)" & type == "rep" ~
"Short Phrase Repetition",
measurement == "Backchanneling (Short Phrases)" & type != "rep" ~
"Short Phrase Interjection",
TRUE ~ type
))
communication_clean |>
filter(measurement == "Backchanneling (Short Phrases)") |>
distinct(type) # A tibble: 2 × 1
type
<chr>
1 Short Phrase Interjection
2 Short Phrase Repetition
Then, I move on to the Lexical Speech group:
communication_clean |>
filter(measurement == "Backchanneling (Lexical Speech)") |>
group_by(type) |>
summarise(count = n()) |>
arrange(desc(count), .by_group = TRUE)# A tibble: 18 × 2
type count
<chr> <int>
1 okay 233
2 yeah 189
3 right 74
4 yep 20
5 yes 11
6 alright 10
7 no 7
8 fine 2
9 nice 2
10 aye 1
11 aye-yeah 1
12 cool 1
13 definitely 1
14 good 1
15 indeed 1
16 maybe 1
17 really 1
18 sweet 1
I notice several variations of “yes” (yeah, yep, aye, aye-yeah). I group them together and leave everything else as it is:
communication_clean <- communication_clean |>
mutate(type = case_when(
measurement == "Backchanneling (Lexical Speech)" &
type %in% c("yes", "yep", "yeah", "aye", "aye-yeah") ~
"yes (including informal variants)",
TRUE ~ type
))
communication_clean |>
filter(measurement == "Backchanneling (Lexical Speech)") |>
distinct(type) # A tibble: 14 × 1
type
<chr>
1 okay
2 yes (including informal variants)
3 right
4 sweet
5 no
6 maybe
7 definitely
8 nice
9 alright
10 cool
11 good
12 fine
13 really
14 indeed
Next up is the Non-lexical Speech group:
communication_clean |>
filter(measurement == "Backchanneling (Non-lexical Speech)") |>
group_by(type) |>
summarise(count = n()) |>
arrange(desc(count), .by_group = TRUE)# A tibble: 15 × 2
type count
<chr> <int>
1 mhmm 67
2 oh 60
3 mh 58
4 aha 17
5 ah 16
6 aw 5
7 yay 5
8 Unknown 2
9 aa 2
10 hm 2
11 mmm 2
12 nah 1
13 ouh 1
14 wow 1
15 yohoo 1
Since it’s difficult to reliably distinguish what each sound means (e.g. mhmmm, mh, hm), I group them all under “Non-lexical Vocalisation”:
communication_clean <- communication_clean |>
mutate(type = case_when(
measurement == "Backchanneling (Non-lexical Speech)" ~ "Non-lexical Vocalisation",
TRUE ~ type
))
communication_clean |>
filter(measurement == "Backchanneling (Non-lexical Speech)") |>
distinct(type) # A tibble: 1 × 1
type
<chr>
1 Non-lexical Vocalisation
Finally, I check the overall field to make sure there are no outstanding issues:
communication_clean |>
distinct(type)# A tibble: 19 × 1
type
<chr>
1 <NA>
2 Laugh
3 Non-lexical Vocalisation
4 okay
5 yes (including informal variants)
6 right
7 Short Phrase Interjection
8 Short Phrase Repetition
9 sweet
10 no
11 maybe
12 definitely
13 nice
14 alright
15 cool
16 good
17 fine
18 really
19 indeed
Everything looks good - the data is relevant, complete, accurate, and is within the acceptable range of values. The last step is to convert column to a factor and rename the column to something clearer like backchannel_response:
communication_clean <- communication_clean |>
rename(backchannel_response = type) |>
mutate(backchannel_response = as.factor(backchannel_response))3.3.5 Variable participant_id
The final variable in this dataset is the participant_id. This variable is essential for my analysis as it will be used as a key to join communication and rapport datasets together.
First, I check what the values look like and if there are any missing values.
# Investigate the field:
unique(communication_clean$participant_id) [1] "P10(1)" "P11(1)" "P12(1)" "P13(1)" "P14(1)" "P15(1)" "P16(1)" "P18(1)"
[9] "P19" "P2" "P20" "P21" "P22" "P23" "P24" "P26(1)"
[17] "P27(1)" "P28" "P29(1)" "P3" "P30(1)" "P31(1)" "P32(1)" "P34"
[25] "P35" "P36" "P37" "P38" "P39" "P4(1)" "P40" "P46(1)"
[33] "P47(1)" "P48(1)" "P5(1)" "P50(1)" "P51(1)" "P52(1)" "P53(1)" "P54(1)"
[41] "P55(1)" "P56(1)" "P58(1)" "P59(1)" "P6" "P60(1)" "P61(1)" "P62(1)"
[49] "P63(1)" "P64(1)" "P66(1)" "P67(1)" "P68(1)" "P69(1)" "P7(1)" "P70(1)"
[57] "P71(1)" "P72(1)" "P8(1)"
# Check for missing values:
sum(is.na(communication_clean$participant_id))[1] 0
There are no missing values, indicating that the data is complete. However, since participant IDs in this table come from file names, they do not follow a simple numeric format. Instead, they appear as “P” followed by the ID number, sometimes with “(1)” at the end. To extract just the numeric part, I use str_extract() function:
communication_clean <- communication_clean |>
# str_extract function extracts the first match of a pattern
# from a string:
# "\\d+" is a regular expression pattern with "\\d" referring to
# any digit, and "+" meaning one or more of the preceding character.
mutate(participant_id = str_extract(participant_id, "\\d+")) |>
# Finally, I transform the character values into numeric vectors:
mutate(participant_id = as.numeric(participant_id))This pulls out the numbers and converts them to a numeric format, ensuring the field is correctly structured and matches the numeric data type.
Finally, the values in this field should fall between 1 and 72, and each number should be unique with no repetitions due to trailing whitespaces:
communication_clean |>
filter(participant_id > 72 | participant_id < 1)# A tibble: 0 × 4
# ℹ 4 variables: measurement <fct>, measurement_length_s <dbl>,
# backchannel_response <fct>, participant_id <dbl>
# 0 values - which means all values fall within the range of 1 to 72.
communication_clean |>
distinct(participant_id) |>
arrange(participant_id)# A tibble: 59 × 1
participant_id
<dbl>
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 10
9 11
10 12
# ℹ 49 more rows
# There are 59 unique participants which matches the number of participants who
# were included in my research project. All ID numbers fall within the acceptable range and seem to be accurate with the total number of unique ID numbers matching the total number of participants (= 59).
With that, the data validation and cleaning up for the communication dataset is complete and we can remove the original dataset with the cleaned data stored in communication_clean:
rm(communication)3.4 Data Aggregation
Now that I have cleaned and validated data from the communication dataset, I can safely transform it into a more useful format for analysis. I start by summarising the measurement_length_s values by participant_id and measurement using the group_by() and summarise() functions. I create a new dataset called communication_long to store the aggregated data.
Since there isn’t enough statistical power to analyse each category of backchanneling individually, I decide to combine all of them into a single group — Backchanneling.
communication_long <- communication_clean |>
mutate(
measurement = if_else(str_detect(measurement, "Backchanneling"),
"Backchanneling",
measurement)
) |>
group_by(participant_id, measurement) |>
summarise(total_measurement_length_s = sum(measurement_length_s), .groups = "drop")The data in the communication_long is presented in long format, where each row represents a unique observation type, with multiple rows corresponding to a single participant. To ensure only a single row corresponds to each participant, I convert the data to a wider format using pivot_wider() function. I store the newly formatted data in a dataset called communication_wide:
communication_wide <- communication_long |>
pivot_wider(id_cols = participant_id,
values_from = total_measurement_length_s,
names_from = measurement) To check if my transformation was successful, I look at how many unique participant IDs are in the dataset using use unique() and length() functions. I expect to find 59 unique values:
length(unique(communication_wide$participant_id))[1] 59
As expected, there are 59 unique participants.
I now have the data in two formats — wide and long — both of which are useful at different stages of analysis and data sharing.
Before moving on to the rapport dataset, I first focus on the communication_wide dataset. Since it stores aggregated data, I use it to complete the following tasks:
Check for any implicit missing values.
Remove the “Swap” periods while accounting for the decrease in interaction length and registering that a role swap occured during the interaction.
Convert measurement lengths from seconds to minutes, ensuring that the total length of all engagement periods still matches the length of the entire interaction.
Calculate the proportion of time each participant spent verbally engaged with their interaction partner, and remove engagement data reported in absolute values.
Calculate the frequency of backchanneling for each participant, and remove backchanneling data reported in absolute values.
Once these steps are complete, I will update the communication_long dataset, transferring the modified data from communication_wide.
Let’s get started.
3.4.1 Working with Wide Format Data
In the wide format, a participant’s repeated measurements are aggregated into a single row, and each type of measurement is in a separate column.
str(communication_wide)tibble [59 × 6] (S3: tbl_df/tbl/data.frame)
$ participant_id : num [1:59] 2 3 4 5 6 7 8 10 11 12 ...
$ Backchanneling : num [1:59] 97 38 44 25 36 22 33 NA 16 24 ...
$ Engagement (High): num [1:59] 385 159 172 119 192 ...
$ Engagement (Low) : num [1:59] 488.5 151.7 366.2 174.8 25.6 ...
$ Engagement (Swap): num [1:59] 5.18 NA NA 10.99 NA ...
$ Interaction : num [1:59] 879 311 538 305 217 ...
3.4.1.1 Identifying Implicit Missing Values
Since every row-column combination must contain a value, any implicit missing values (e.g., no interaction measurement for a given participant) will become more apparent. In R, missing values are usually represented as NA. Let’s check if there are any NA values in the communication_long dataset using is.na()function:
communication_wide |>
filter(if_any(everything(), is.na)) |>
count()# A tibble: 1 × 1
n
<int>
1 49
There are 49 participants with at least on NA value in their measurements. Let’s examine these cases more closely:
communication_wide |>
filter(if_any(everything(), is.na)) # A tibble: 49 × 6
participant_id Backchanneling `Engagement (High)` `Engagement (Low)`
<dbl> <dbl> <dbl> <dbl>
1 3 38 159. 152.
2 4 44 172. 366.
3 6 36 192. 25.6
4 8 33 185. 64.6
5 10 NA NA 96.6
6 12 24 80.6 57.6
7 13 8 19.9 75.5
8 14 6 33.5 140.
9 15 6 62.5 84.4
10 16 3 87.1 21.7
# ℹ 39 more rows
# ℹ 2 more variables: `Engagement (Swap)` <dbl>, Interaction <dbl>
It seems that a large proportion of NA values appear in the Engagement (Swap) column. In this column, NA values should be treated as zeros. I check how many participants still have missing values after excluding this column:
communication_wide |>
select(-`Engagement (Swap)`) |>
filter(if_any(everything(), is.na)) # A tibble: 4 × 5
participant_id Backchanneling `Engagement (High)` `Engagement (Low)`
<dbl> <dbl> <dbl> <dbl>
1 10 NA NA 96.6
2 27 55 275. NA
3 39 NA 6.98 85.5
4 52 14 122. NA
# ℹ 1 more variable: Interaction <dbl>
Only 4 participants remain with NA values. Similarly, in these four cases, the missing values should instead be treated as zeros. Let’s update them all accordingly:
communication_wide <- communication_wide |>
mutate(across(everything(), ~replace_na(.x, 0)))Now, all NA have been replaced with 0.
3.4.1.2 Handling Swap Periods
Since autistic individuals typically show a preference for more restricted patterns of behaviour, I decide to remove all swap periods (when participants switched interaction roles with their partners) from my analysis and focus only on High and Low engagement periods.
Before doing that, I create a new column to indicate whether a participant switched roles during the interaction. The wide-format data is ideal for this since each row corresponds to a single unique participant.
I add a new column to the communication_wide dataset containing a logical vector (TRUE/FALSE). A swap is recorded if the Engagement (Swap) measurement is greater than 0:
communication_wide <- communication_wide |>
mutate(`Swap Occurred` = `Engagement (Swap)` > 0)I then adjust the Interaction column using mutate() function to account for the decrease in interaction length caused by removing Swap periods:
communication_wide <- communication_wide |>
mutate(
Interaction = round(Interaction, 3) - round(`Engagement (Swap)`, 3)
)With that done, I safely remove the Engagement (Swap) column from the aggregated dataset:
communication_wide <- communication_wide |>
select(-`Engagement (Swap)`)3.4.1.3 Converting Time Units
Since the total length of each measurement for a participant is quite large, I convert the units from seconds to minutes by dividing each value by 60 and rounding to two decimal places.
To maintain accuracy, I first convert the length of each engagement type and then use these values to update the Interaction column:
communication_wide <- communication_wide |>
mutate(
`Engagement (High)` = round(`Engagement (High)` / 60, 2),
`Engagement (Low)` = round(`Engagement (Low)` / 60, 2),
# To avoid discrepancies between total engagement length
# and interaction length due to rounding, I calculate
# interaction length in minutes by summing
# engagement_high_min and engagement_low_min:
Interaction = `Engagement (High)` + `Engagement (Low)`
) 3.4.1.4 Calculating Proportion of Engagement
For my analysis, I’m interested in how much time each participant spends verbally engaging with their interaction partner, relative to the full length of the interaction. So instead of using absolute values, I need proportions. To do that, I divide each Engagement (High) value by the Interaction value for each participant. I then remove the Engagement (High) and Engagement (Low) columns, as I no longer need the absolute values:
communication_wide <- communication_wide |>
mutate(
`Proportion of Verbal Engagement` = round(`Engagement (High)` / Interaction, 2)
) |>
select(-`Engagement (High)`, -`Engagement (Low)`)3.4.1.5 Calculating Frequency of Backchanneling
Similarly, I’m also interested in the frequency of backchanneling rather than the absolute number of backchannel responses each participant produced during their interaction. So instead of using absolute values, I need frequencies. To do that, I divide each Backchanneling value by the Interaction value for each participant. I then remove the Backchnaneling column:
communication_wide <- communication_wide |>
mutate(
`Rate of Backchanneling` = round(`Backchanneling` / Interaction, 2)
) |>
select(-`Backchanneling`)3.4.2 Working with Long Format Data
Now that I have modified the communication_wide dataset, I update communication_long accordingly:
communication_long <- communication_wide |>
pivot_longer(cols = `Interaction`:`Rate of Backchanneling`,
names_to = "measurement",
values_to = "measurement_length") Importantly, I include the new Swap Occurred column. Since it contains logical data (TRUE/FALSE), it is automatically converted to numeric format, where 1 = TRUE and 0 = FALSE.
With these steps completed, I’m finished working with aggregated data from the communication_clean dataset and can move on to the rapport dataset.
3.5 Dataset rapport Clean-up
The rapport dataset, provided by my project supervisor, contains several variables for each participant. I start by selecting the variables relevant for my analysis, which are:
participant_id,
asc_status,
research_day_condition, and
rubiks_succeed_rapport_total.
I store these in a new dataset called rapport_clean:
rapport_clean <- rapport |>
select(participant_id,
asc_status,
research_day_condition,
rubiks_succeed_rapport_total)Let’s have a quick look at the dataset overall structure:
str(rapport_clean)tibble [72 × 4] (S3: tbl_df/tbl/data.frame)
$ participant_id : num [1:72] 1 2 3 4 5 6 7 8 9 10 ...
$ asc_status : chr [1:72] "Neurotypical" "Neurotypical" "Neurotypical" "Neurotypical" ...
$ research_day_condition : chr [1:72] "Single_Neurotypical" "Single_Neurotypical" "Single_Neurotypical" "Single_Neurotypical" ...
$ rubiks_succeed_rapport_total: chr [1:72] "449" "485" "325" "470.5" ...
The dataset contains 4 variables and 72 observations. I check if there are any duplicate rows in the dataset:
str(distinct(rapport_clean))tibble [72 × 4] (S3: tbl_df/tbl/data.frame)
$ participant_id : num [1:72] 1 2 3 4 5 6 7 8 9 10 ...
$ asc_status : chr [1:72] "Neurotypical" "Neurotypical" "Neurotypical" "Neurotypical" ...
$ research_day_condition : chr [1:72] "Single_Neurotypical" "Single_Neurotypical" "Single_Neurotypical" "Single_Neurotypical" ...
$ rubiks_succeed_rapport_total: chr [1:72] "449" "485" "325" "470.5" ...
# The same number of observations. There are no duplicate rows, so I can start cleaning and validation the data in each field.
3.5.1 Variable ‘participant_id’
The variable participant_id serves as the primary key for this dataset, meaning it uniquely identifies each row. Since each row corresponds to an individual participant, I first verify that this variable has no missing values, follows a numeric structure and falls within the range of 1 to 72 (inclusive):
# Check for missing values:
sum(is.na(rapport_clean))[1] 0
# Check that it is numeric:
class(rapport_clean$participant_id)[1] "numeric"
# Positive and falls within 1 and 72
rapport_clean |>
filter(participant_id < 0 | participant_id > 72)# A tibble: 0 × 4
# ℹ 4 variables: participant_id <dbl>, asc_status <chr>,
# research_day_condition <chr>, rubiks_succeed_rapport_total <chr>
# 0 values - all values fall within the acceptable rangeThere are no missing values, and the data is complete and accurate. The variable matches the expected numeric data type and falls within the acceptable range. However, there are 72 participants instead of 59. This is because not all of the data is relevant as some participants were excluded from my research for one of the following reasons:
Their interaction partner was the researcher.
Their video recording was lost or incomplete.
I will remove these participants later when I will be joining rapport_clean and communication_wide datasets together.
3.5.2 Variable asc_status
The variable asc_status indicates a participant’s ASC (autism spectrum condition) status — either non-autistic or autistic.
I begin by checking the contents of this field using the unique() function:
unique(rapport_clean$asc_status)[1] "Neurotypical" "Autistic"
There are no missing values or duplicates, confirming that it is accurate and complete but the data doesn’t fully match the desired labels. Specifically, I want to rename “Neurotypical” to “Non-autistic” using the case_when() function:
rapport_clean <- rapport_clean |>
mutate(asc_status = case_when(
asc_status == "Neurotypical" ~ "Non-autistic",
TRUE ~ "Autistic"
))The values now fall within the expected range. I then convert this variable into a factor using as.factor():
rapport_clean <- rapport_clean |>
# Convert `research_day_condition` to a factor:
mutate(asc_status = as.factor(asc_status))
# Check the result:
class(rapport_clean$asc_status)[1] "factor"
levels(rapport_clean$asc_status)[1] "Autistic" "Non-autistic"
Perfect — everything looks in order. Since this variable does not follow any specific structure, all data validation criteria are met and I can move on to the next variable.
3.5.3 Variable research_day_condition
The variable research_day_condition indicates the group to which participants belonged (see Table 4).
Table 4. Summary of research groups.
| Group | Group Description |
| Non-autistic Group | Both participant and their interaction partner are non-autistic. |
| Mixed Group | One participant is autistic, and one is non-autistic. The perspective from which backchanneling and engagement data were collected alternated between different pairs — if in one pair this data was collected from the autistic participant while the rapport score was given by the non-autistic partner, then in the next pair, the perspectives were reversed. |
| Autistic Group | Both participant and their interaction partner are autistic. |
To start, I check if the data in this field matches the categories above:
unique(rapport_clean$research_day_condition)[1] "Single_Neurotypical" "Mixed_Autistic_Neurotypical"
[3] "Single_Autistic"
The data matches the groups — it is accurate and complete. However, the labels should be renamed. Let’s do this using the case_when() function:
rapport_clean <- rapport_clean |>
mutate(research_day_condition = case_when(
research_day_condition == "Single_Neurotypical" ~ "Matched Non-autistic",
research_day_condition == "Single_Autistic" ~ "Matched Autistic",
research_day_condition == "Mixed_Autistic_Neurotypical" ~ "Mixed",
TRUE ~ "no_group"
))Now, the data falls within the acceptable range of values. Finally, let’s convert it to a factor using the as.factor() function:
rapport_clean <- rapport_clean |>
# Convert `research_day_condition` to a factor:
mutate(research_day_condition = as.factor(research_day_condition))
# Check if that worked
class(rapport_clean$research_day_condition)[1] "factor"
levels(rapport_clean$research_day_condition)[1] "Matched Autistic" "Matched Non-autistic" "Mixed"
# Perfect!Now when this variable is a factor with 3 levels it matches the data type for this field. The data does not follow any specific pattern, so we are done here and can move on to the final variable.
3.5.4 Variable rubiks_succeed_rapport_total
The rapport scores provided by each participant’s interaction partner are incorrectly matched in the rapport dataset. This happened because participants were arranged in a chain, where each person interacted with two others — one before them and one after. Crucially, all individuals in the chain were participants in the study; no separate “interaction partners” were brought in. Each participant rated their rapport with the next participant in the sequence. However, the dataset currently links these scores back to the participant who gave them, rather than the one who received them. In other words, the rapport values shown reflect how participants felt about someone else — not how someone else felt about them. (See my Honours Project – Methods: Participants for more detail.)
To correct this, I will remove the old column and replace it with a corrected version. I first increase each participant’s ID by 1 in the original rapport dataset to align the rapport scores with the correct recipient. I will then use the right_join() function to merge this corrected data with the rapport_clean dataset. Finally, I will remove the old, incorrect rubiks_succeed_rapport_total column.
rapport_clean <- rapport |>
# Select key for joining datasets and relevant column:
select(participant_id,
rubiks_succeed_enjoy,
rubiks_succeed_easy,
rubiks_succeed_successful,
rubiks_succeed_understood,
rubiks_succeed_awkward,
rubiks_succeed_friendly,
rubiks_succeed_rapport_total) |>
# Shift each participant ID by +1 to align scores correctly:
mutate(participant_id = participant_id + 1) |>
# Rename the column to avoid confusion:
rename(rapport_score = rubiks_succeed_rapport_total) |>
# Join with the communication_rapport dataset:
right_join(rapport_clean, join_by(participant_id)) |>
# Remove the old, incorrect column:
select(-rubiks_succeed_rapport_total)Now that this correction is complete, I will check whether rapport scores are stored in a numerical format using the str() function.
str(rapport_clean$rapport_score) chr [1:72] "449" "485" "325" "470.5" "453.5" "394" "463" "NA" "462" "347" ...
Currently, it is in a string format. I am going to fix this using as.numeric() function:
rapport_clean <- rapport_clean |>
mutate(rapport_score = as.numeric(rapport_score))Warning: There was 1 warning in `mutate()`.
ℹ In argument: `rapport_score = as.numeric(rapport_score)`.
Caused by warning:
! NAs introduced by coercion
# Check if transformation was successful:
glimpse(rapport_clean$rapport_score) num [1:72] 449 485 325 470 454 ...
The data now matches the expected type.
When that was done, a warning message appeared indicating that some values have been transformed into NA values. This makes sense, as a few participants interacted with the researcher and did not provide a rapport score:
rapport_clean |>
filter(is.na(rapport_score))# A tibble: 9 × 10
participant_id rubiks_succeed_enjoy rubiks_succeed_easy rubiks_succeed_succe…¹
<dbl> <chr> <chr> <chr>
1 9 NA NA NA
2 17 NA NA NA
3 25 NA NA NA
4 33 NA NA NA
5 41 NA NA NA
6 49 NA NA NA
7 57 NA NA NA
8 65 NA NA NA
9 1 <NA> <NA> <NA>
# ℹ abbreviated name: ¹rubiks_succeed_successful
# ℹ 6 more variables: rubiks_succeed_understood <chr>,
# rubiks_succeed_awkward <chr>, rubiks_succeed_friendly <chr>,
# rapport_score <dbl>, asc_status <fct>, research_day_condition <fct>
These participants were excluded from the research and will be excluded from the analysis when I join the rapport_clean and communication_wide datasets together in the Section: Data Join, so I leave them for now.
Next, I verify that all values are non-negative using logical variable < 0:
rapport_clean |>
filter(
rapport_score < 0
)# A tibble: 0 × 10
# ℹ 10 variables: participant_id <dbl>, rubiks_succeed_enjoy <chr>,
# rubiks_succeed_easy <chr>, rubiks_succeed_successful <chr>,
# rubiks_succeed_understood <chr>, rubiks_succeed_awkward <chr>,
# rubiks_succeed_friendly <chr>, rapport_score <dbl>, asc_status <fct>,
# research_day_condition <fct>
There are no negative values, confirming that the data is complete, accurate and falls within the acceptable range. The data does not follow any specific structure so this criterion does not apply here.
With that, the data validation and cleaning up for the rapport dataset is complete and we can remove the original dataset:
rm(rapport)3.6 Data Join
Now that I’ve cleaned both the communication and rapport datasets, and reformatted the communication data, I can combine them for the analysis. Using the participant_id column as the key, I join the communication_wide and rapport_clean datasets, storing the combined data in a new dataset called communication_rapport:
communication_rapport <- communication_wide |>
left_join(rapport_clean, join_by(participant_id))With this step completed, only the data relevant for the analysis is included.
3.6.1 Renaming and Organising Columns
As a final touch before analysis, I rename and reorganise the communication_rapport columns to make them clearer and more representative of the information they hold.
I place the participant ID at the start, followed by the participant’s group. Next, I add the communication-related variables, and finally, the rapport scores given by participants’ interaction partners:
communication_rapport <- communication_rapport |>
rename(backchanneling_freq = `Rate of Backchanneling`,
engagement_prop = `Proportion of Verbal Engagement`,
interaction_min = Interaction,
swap_y_n = `Swap Occurred`,
research_group = research_day_condition) |>
select(participant_id,
asc_status,
research_group,
interaction_min,
engagement_prop,
backchanneling_freq,
swap_y_n,
rapport_score)With this, I am ready to analyse my data.
4 Analysing Data
At this stage of the data analysis, it’s time to go through my clean data, explore it, and check whether the predictions I made earlier hold up. The analysis phase is at the core of any data analysis project.
In this project, I compare autistic, non-autistic, and mixed pairs across four key variables: interaction, rapport, engagement, and backchanneling.
I also look into the relationship between engagement and rapport, as well as backchanneling and rapport, and test whether either of these variables can predict the rapport score given by the participant’s interaction partner — both overall and within each group individually.
Additionally, I investigate whether the switch in interaction roles occurs across the groups with the same frequency.
Before diving into the analysis, I want to go through the statistical tests I used to analyse my data.
4.1 Overview of Statistical Methods
Since I applied many of the statistical tests multiple times, it made sense to write them as functions rather than repeating the same code over and over. This way, I could keep things efficient and consistent.
For each statistical test, I’ll walk through:
Why I chose it.
The assumptions behind it and how I checked them.
How to interpret the output of each function used for the test.
4.1.1 Spotting Outliers: Why It Matters
Before I go over the statistical tests I used to analyse my data, I want to briefly focus on outliers — those extreme values that can sometimes throw off results. If there are extreme values in the data, they can distort the results in ways that make it harder to trust the findings. This is especially relevant since I use tests like one-way ANOVA and linear regression, which are particularly sensitive to outliers. So, before running any tests, I needed to assess whether my dataset contained any outliers, and how severe they might be.
To identify outliers, I built a function that differentiates between mild and extreme outliers. For context, in my dissertation, I followed Halldestam’s (2016) method of defining them:
Mild outliers are values that fall just outside the usual range (lie beyond ±1.5 times the interquartile range (IQR) but remain within the outer fences of the boxplot).
Extreme outliers are values that sit far beyond what’s expected (outside the outer fences of the boxplot) and can significantly impact the analysis.
These “outer fences” are calculated as:
Lower outer fence = Q1 − (3 × IQR)
Upper outer fence = Q3 + (3 × IQR),
where: Q1 and Q3 are the first and third quartiles, representing the 25th and 75th percentiles of the data; and IQR (interquartile range) is the spread between Q1 and Q3, covering the middle 50% of the data.
By systematically identifying and classifying outliers, I was able to make informed decisions about whether to transform the data, exclude certain points (not a favorable approach when working with human data in my opinion), or simply be mindful of their influence on the analysis.
Here’s the function I wrote to identify and classify outliers. :
identify_outliers <- function(data, group_var, response_var) {
data |>
group_by({{ group_var }}) |>
mutate(
q1 = quantile({{ response_var }}, 0.25, na.rm = TRUE),
q3 = quantile({{ response_var }}, 0.75, na.rm = TRUE),
iqr = q3 - q1,
mild_threshold_low = q1 - 1.5 * iqr,
mild_threshold_high = q3 + 1.5 * iqr,
extreme_threshold_low = q1 - 3 * iqr,
extreme_threshold_high = q3 + 3 * iqr,
outlier_type = case_when(
{{ response_var }} < extreme_threshold_low |
{{ response_var }} > extreme_threshold_high ~ "Extreme",
{{ response_var }} < mild_threshold_low |
{{ response_var }} > mild_threshold_high ~ "Mild",
TRUE ~ "None"
)
) |>
filter(outlier_type != "None") |>
ungroup() |>
select({{ group_var }}, {{ response_var }}, outlier_type) |>
arrange({{ group_var }})
}To use this function, you need to specify:
your dataset (
data),the grouping variable - the categorical variable that defines the groups (
group_var), andthe response variable - the continuous variable you want to investigate (
response_var).
The function returns a subset of the data, containing only the identified outliers and their assigned class (“Mild” or “Extreme”).
4.1.2 One-way ANOVA
A one-way ANOVA (analysis of variance) test is a parametric test designed to compare k number of groups. In my case, I had three groups — autistic, non-autistic, and mixed — so this was the most appropriate statistical test to compare these groups across .
4.1.2.1 Checking Assumptions
Since I used one-way ANOVA to analyse most variables in this project, I want to go over the key assumptions that need to be met before using it. For one-way ANOVA test’s output to be valid and trustworthy, the data needs to meet the following assumptions (in order of importance):
Random Sampling - Each participant in the population should have an equal chance of being selected.
Independence – Each observation should be independent of the others (no pseudoreplication).
Homogeneity of variance – The spread of data across groups should be roughly the same.
Normality – The data within each group should be roughly normally distributed.
The first two assumptions depend on the research design and how the data was collected, so there’s not much you can do about them post hoc. But the last two — homogeneity of variance and normality — can (and should) be checked using the data itself. To do this, I wrote two functions.
4.1.2.1.1 Visual Inspection with Boxplots
The first function creates boxplots for each group. Boxplots are a quick way to spot potential problems — like if one group has a much wider spread than the others or if there’s a strong skew in the data.
plot_boxplots <- function(data, group_var, response_var) {
ggplot(data, aes(x = {{ group_var }}, y = {{ response_var }})) +
# Plot boxplots:
geom_boxplot(outliers = FALSE) +
# Plot individual data points:
geom_jitter(width = 0.1) +
theme_minimal() +
# substitute(): captures the name of the variable passed as
# the function argument
# as.character(): converts the name intro a character string
# so it can be displayed as the label
labs(title = "Boxplots of Response Variable by Group",
x = as.character(substitute(group_var)),
y = as.character(substitute(response_var)))
}To use this function, (similarly to the function identify_outliers) you need to provide:
the dataset (
data),the grouping variable (
group_var), andthe response variable (
response_var).
The function returns a boxplot for each group. You can use them to visually check if the groups have a similar spread and distribution.
4.1.2.1.2 Analysis with Diagnostic Plots
The second function digs deeper into the assumptions of normality and equal variance. It does this by plotting two diagnostic plots for your model:
Residuals vs Fitted Plot – This helps to check if variance is roughly equal across groups. Ideally, you would expect a random scatter of points without any distinct pattern. If you see a fan or bow tie shaped spread, this suggests a violation of homogeneity of variance.
Q-Q Plot – This checks if the residuals follow a normal distribution. If the points follow the diagonal line closely, this suggests that the normality assumption holds. Significant deviations from the line would suggest non-normality.
run_diagnostic_plots <- function(data, group_var, response_var,
log_transformation = FALSE, sqrt_transformation = FALSE) {
# Fit the original model
data.lm <- lm(data[[response_var]] ~ data[[group_var]], data = data)
# Conditionally fit log-transformed model
if (log_transformation) {
data.lm.log <- lm(log(data[[response_var]]) ~ data[[group_var]], data = data)
}
# Conditionally fit sqrt-transformed model
if (sqrt_transformation) {
data.lm.sqrt <- lm(sqrt(data[[response_var]]) ~ data[[group_var]], data = data)
}
# Always plot the original model
par(mfrow = c(1, 2)) # Arrange plots side by side
plot(data.lm, which = c(1, 2), main = "Original Model")
# Conditionally plot log transformation
if (log_transformation) {
par(mfrow = c(1, 2))
plot(data.lm.log, which = c(1, 2), main = "Log-transformed Model")
}
# Conditionally plot sqrt transformation
if (sqrt_transformation) {
par(mfrow = c(1, 2))
plot(data.lm.sqrt, which = c(1, 2), main = "Square Root-transformed Model")
}
}To use this function, you need to specify:
the dataset (
data),the grouping variable (in quotes: “
group_var”),the response variable (in quotes: “
response_var”), andwhether you want to apply log and/or square root transformations (
log_transformation = TRUEor/andsqrt_transformation = TRUE).
If the first plot shows a clear pattern (e.g., the spread increases/decreases as the values increase/decrease, or one group has a much larger variance), it suggests a violation of the homogeneity of variance assumption.
If the second plot shows points deviating significantly from the diagonal line, it indicates that the normality assumption may not hold.
If either assumption is violated, there are two main options:
Apply a transformation – Log or square root transformations can sometimes help equalise variance and improve normality (both options are available in the function).
Proceed with caution – If transformations don’t resolve the issue, it’s important to acknowledge the violation and interpret results carefully.
If no data transformation is effective, the p-value obtained from the test may be affected. A violation of the normality assumption can slightly increase or decrease the p-value, though the effect is usually small. However, if the assumption of equal variance is violated, there is a higher risk of incorrectly rejecting the null hypothesis (i.e., an increased likelihood of a Type I error).
By running these functions before performing ANOVA, I was able to check whether my data met the necessary assumptions. If the assumptions weren’t met, I considered transformations or kept the violations in mind when conducting the test and interpreting the results.
4.1.2.2 Conducting the Test
After checking that the assumptions were met (or at least keeping any violations in mind), I ran a one-way ANOVA to compare the differences in my response variable across the three groups (autistic, non-autistic, and mixed). The function I wrote allows for flexibility in running the test with either the original data or transformed versions if needed:
run_one_way_anova <- function(data, group_var, response_var, method) {
# Check if method is valid
if (!method %in% c("original", "log", "sqrt")) {
stop("Invalid method. Choose one of: 'original', 'log', or 'sqrt'.")
}
# Fit the model based on the chosen method
if (method == "original") {
data.lm <- lm(data[[response_var]] ~ data[[group_var]], data = data)
anova(data.lm)
} else if (method == "log") {
data.lm.log <- lm(log(data[[response_var]]) ~ data[[group_var]], data = data)
anova(data.lm.log)
} else if (method == "sqrt") {
data.lm.sqrt <- lm(sqrt(data[[response_var]]) ~ data[[group_var]], data = data)
anova(data.lm.sqrt)
}
}The function requires 4 inputs:
The dataset (
data),The grouping variable (“
group_var”), which defines the categories being compared,The response variable (“
response_var”), which is the variable being analysed,The transformation method (“
method”), which can be: “original”, “log”, or “sqrt” . Depending on the chosen method:If “
original” is selected, it fits a standard linear model (lm) using the raw data and performs an ANOVA on it.If “
log” is selected, it applies a log transformation to the response variable before fitting the model.If “
sqrt” is selected, it applies a square root transformation before fitting the model.
The function returns the ANOVA table, which includes key statistical values such as the F-statistic and p-value. These results help assess the likelihood of a real difference between at least two of the groups. For my approach to interpreting p-values, see p-values Interpretation section.
4.1.2.3 Tukey HSD Post-hoc Test
If there is enough evidence to reject the null hypothesis, I would conduct a Tukey HSD (honestly significant difference) post-hoc analysis to determine which groups differ. The function I wrote allows for flexibility in transformation, ensuring that the test can be applied to original, log-transformed, or square root-transformed data.
run_tukey_hsd <- function(data, group_var, response_var, method) {
# Check if the chosen method is valid
if (!method %in% c("original", "log", "sqrt")) {
stop("Invalid method. Choose from: 'original', 'log', or 'sqrt'.")
}
# Fit the model based on the selected transformation
formula <- as.formula(paste(response_var, "~", group_var))
if (method == "original") {
data.lm <- lm(formula, data = data)
} else if (method == "log") {
data.lm <- lm(as.formula(paste("log(", response_var, ") ~", group_var)), data = data)
} else if (method == "sqrt") {
data.lm <- lm(as.formula(paste("sqrt(", response_var, ") ~", group_var)), data = data)
}
# Compute estimated marginal means (emmeans) with back-transformation
data.lm.emmeans <- emmeans(data.lm, specs = group_var, type = "response")
print(data.lm.emmeans) # Display emmeans results
# Perform pairwise comparisons with back-transformation
pairwise_results <- pairs(data.lm.emmeans)
print(pairwise_results)
confint(pairwise_results)
}The function requires the same arguments as run_one_way_anova function. The function returns a table of pairwise comparisons between groups, helping to identify where differences exist. The key outputs include:
Estimated Marginal Means (emmeans): These represent the adjusted means for each group, accounting for the chosen transformation (if there was a transformation).
Pairwise Comparisons: A table showing the difference between each pair of groups.
Confidence Intervals: If the confidence interval for a comparison does not include zero, this suggests a meaningful difference between those groups.
p-values: These indicate the likelihood that the observed differences occurred by chance.
4.1.3 Permutation Test
A permutation test (also known as a randomization test) is a computer-intensive method used for hypothesis testing when data doesn’t follow a normal distribution. I use this test as an alternative to the Two-Sample t-test or the Mann–Whitney U-test to compare autistic and non-autistic individuals in mixed pairs across engagement and backchanneling variables.
4.1.3.1 Checking Assumptions
One of the main advantages of permutation tests is that they rely on very few assumptions:
Random Sampling - The data must be a random sample from the population.
Similar Shapes of Distributions - The variable being compared must have the same distributional shape across groups (this only matters when comparing means or medians).
The first assumption is purely based on study design and can’t be fixed post hoc. The second assumption is trickier — permutation tests are fairly robust to violations of this, but only when the sample size is large, which isn’t the case in this project. To check whether the second assumption holds, I visualise the data using histograms. I also simulate data from a normal distribution with the same mean and standard deviation for comparison. To do that, I wrote the following function:
check_distribution_in_mixed_pairs <- function(data, response_var) {
# Filter to mixed neurotype pairs only
data_filtered <- data |>
filter(research_group == "Mixed")
# Plot actual distributions for two groups
actual_plot <- ggplot(data_filtered, aes(x = {{ response_var }})) +
geom_histogram(bins = 10, fill = "steelblue", colour = "black", alpha = 0.7) +
facet_wrap(~asc_status, nrow = 2) +
theme_light() +
labs(
title = "Frequency Distribution by ASC status in Mixed Pairs",
x = as.character(substitute(response_var)),
y = "Count"
)
# Get the necessary stats to plot a simulation
data_summary <- data_filtered |>
# Apply functions below to each group (autistic and non-autistic) separately:
group_by(asc_status) |>
summarise(
# Calculate mean value for each group:
mean = mean({{ response_var }}, na.rm = TRUE),
# Calculate variance value (i.e., SD) for each group:
sd = sd({{ response_var }}, na.rm = TRUE),
# Calculate sample size (i.e., n) for each group:
n = n(),
.groups = "drop"
)
# Simulate data from a normal distribution using group stats
# set.seed function sets the starting point in the sequence of pseudorandom numbers:
set.seed(123) # this makes distribution consistent and reproducible
normal_sim <- data_summary |>
# treat each row independently:
rowwise() |>
# Generate n data points for each group accounting for the group's mean and SD:
mutate(
simulated = list(rnorm(n, mean, sd))
) |>
# Expand the list column
unnest_longer(simulated)
# Plot simulated data
sim_plot <- ggplot(normal_sim, aes(x = simulated)) +
geom_histogram(bins = 10, fill = "darkgreen", colour = "black", alpha = 0.7) +
facet_wrap(~asc_status, nrow = 2) +
theme_light() +
labs(
title = "Simulated Normal Distribution by ASC status",
x = as.character(substitute(response_var)),
y = "Count"
)
# Return both plots
plot(actual_plot)
plot(sim_plot)
}To use this function, you need to provide:
The dataset (
data), andThe response variable (
response_var).
The function returns four histograms: two for the autistic group and two for the non-autistic group. One set shows the actual distributions from your dataset, while the other shows simulated distributions generated from a normal distribution, using each group’s sample size, mean, and standard deviation.
The simulated histograms help me visualise the kind of variability one might expect if the underlying population distributions were matched in shape. By comparing the actual histograms with the simulated ones, I can judge whether the groups appear to come from similarly shaped distributions — and therefore whether the second assumption of the permutation test is likely to hold.
4.1.3.2 Conducting the Test
Once I’ve checked that the shape of the distributions is reasonably similar between groups, I can go ahead and run the permutation test. I will use the function oneway_test()from the coin package, which applies the Fisher-Pitman permutation test to the data.
It requires only 2 arguments:
The response and grouping variable, specified using the formula syntax (
response_var ~ group_var), andThe dataset (
data =).
The output yields a test statistic and the corresponding p-value, indicating if there is any evidence for the difference between the groups.
4.1.4 Simple Linear Regression
Simple linear regression is a method used to predict values of one numerical variable from the values of another numerical variable. In this project, I use it to test whether a participant’s engagement and/or their backchanneling frequency can predict the rapport score given by their interaction partner. I first test this overall and then in each group individually.
Why not ANCOVA?
ANCOVA (analysis of covariance) is a method that combines elements of ANOVA and regression. It’s used to predict values of one numerical variable from both a numerical and a categorical variable. In this case, the categorical variable would be the research group (autistic, non-autistic, and mixed).
While ANCOVA could be used here in theory, I decided not to use it during my dissertation for two reasons. First, I don’t believe I have enough power to run this test reliably given the small group sizes. Second, as you’ll see later, not all groups meet the assumptions required for linear regression, which means ANCOVA wouldn’t be suitable either. So here, I stick with simple linear regression.
Importantly, I’m aware that running simple linear regression multiple times increases the risk of Type I error. However, this was a dissertation project with limited sample sizes to begin with, so all findings should be taken with great caution anyway.
4.1.4.1 Checking Assumptions
Before fitting a regression model, the data should meet 4 key assumptions:
Independence and Random Sampling – Each observation should be independent, and the data should be randomly sampled from the population.
Linearity – The relationship between the explanatory variable (i.e., backchanneling frequency or engagement proportion) and the response variable (rapport score) should be roughly linear.
Homoscedasticity – The residuals should have a similar spread across the entire range of the explanatory variable.
Normality – Residuals should be approximately normally distributed around the regression line.
The first assumption is built into the study design, so we can’t check it post hoc. The remaining three can be checked visually.
To check linearity, I plotted a scatterplot with a fitted line. I wrote the function below to do this:
plot_scatter_with_fitted_line <- function(data, predict_var, response_var,
log_transformation = FALSE, sqrt_transformation = FALSE) {
# Original scale
p_original <- ggplot(data, aes(x = {{ predict_var }}, y = {{ response_var }})) +
geom_point(alpha = 0.7) +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
theme_minimal() +
ylim(0, NA) +
labs(
title = "Scatterplot with Smoothing",
x = as_label(enquo(predict_var)),
y = as_label(enquo(response_var))
)
print(p_original)
# Log-transformed
if (log_transformation) {
p_log <- ggplot(data, aes(x = {{ predict_var }}, y = log({{ response_var }}))) +
geom_point(alpha = 0.7) +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
theme_minimal() +
labs(
title = "Log-transformed Scatterplot with Smoothing",
x = as_label(enquo(predict_var)),
y = paste0("log(", as_label(enquo(response_var)), ")")
)
print(p_log)
}
# Sqrt-transformed
if (sqrt_transformation) {
p_sqrt <- ggplot(data, aes(x = {{ predict_var }}, y = sqrt({{ response_var }}))) +
geom_point(alpha = 0.7) +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
theme_minimal() +
ylim(0, NA) +
labs(
title = "Square Root-transformed Scatterplot with Smoothing",
x = as_label(enquo(predict_var)),
y = paste0("sqrt(", as_label(enquo(response_var)), ")")
)
print(p_sqrt)
}
}The function requires 4 arguments:
the dataset (
data),the predictor variable (
predict_var) which is the variable being used to predict the behaviour of the response variable,the response variable (
response_var), andwhether you want to apply log and/or square root transformations (
log_transformation = TRUEor/andsqrt_transformation = TRUE).
The function returns scatterplot with the fitted line, which can help check whether the relationship is linear. If it’s not, then applying a transformation (either logarithmic or square root) might help.
To check homoscedasticity and normality, I reused the same diagnostic plots I used for one-way ANOVA, through the run_diagnostic_plots function. Homoscedasticity assumption is equivalent to the homogeneity of variance, and is checked with the same plot. For the explanation of how this function works and how to interpret its output, see the section One-way ANOVA: Checking Assumptions.
By running these functions, I could check if my data met the necessary assumptions for the regression analysis. If the assumptions weren’t met, I applied transformations or kept the violations in mind when conducting the analysis and interpreting the results.
4.1.4.2 Conducting the Test
Once I confirmed that the assumptions were met (or considered the violations), I used the following function to run the simple linear regression:
run_simple_linear_regression <- function(data, predict_var, response_var, method = "original") {
# Check if method is valid
if (!method %in% c("original", "log", "sqrt")) {
stop("Invalid method. Choose one of: 'original', 'log', or 'sqrt'.")
}
# Fit the model based on the chosen method
if (method == "original") {
model <- lm(data[[response_var]] ~ data[[predict_var]], data = data)
summary(model)
} else if (method == "log") {
model <- lm(log(data[[response_var]]) ~ data[[predict_var]], data = data)
summary(model)
} else if (method == "sqrt") {
model <- lm(sqrt(data[[response_var]]) ~ data[[predict_var]], data = data)
summary(model)
}
}This function performs a simple linear regression analysis and requires 4 inputs:
The dataset (“
data”),The the predictor variable variable (“
predict_var”),The outcome variable (“
response_var”), andThe transformation method (“
method”), which can be: “original”, “log”, or “sqrt” . Depending on the chosen method:If “
original” is chosen, the function fits a linear regression model using the raw (untransformed) outcome variable.If “
log” is chosen, it applies a log transformation to the outcome variable before fitting the model.If “
sqrt” is chosen, it applies a square root transformation to the outcome variable before fitting the model.
The function returns the summary output of the fitted regression model. This includes estimates of the model coefficients — intercept and slope, the R-squared, and the p-value. These values help evaluate how well the explanatory variable explains the variation in the response variable. Specifically, the p-value shows how likely it is that the relationship we see could have happened by chance, while R-squared tells us how much of the variation in the outcome can be explained by the predictor.
4.1.5 Fisher’s Exact Test
Fisher’s exact test is a contingency test used to examine whether two categorical variables are independent. In this case, I’ll analyse whether the relative frequency of role switches differs across the three research groups – autistic, non-autistic, and mixed. Since we’re dealing with three groups, I’ll use the Freeman-Halton extension to apply Fisher’s Exact Test to a 2×3 contingency table.
Why not contingency χ2 test?
The χ2 test (Chi-square test of independence) is the most commonly used method for checking associations between categorical variables. However, it has 5 key assumptions:
Both variables are categorical.
All observations are independent and randomly sampled.
Cells in contingency table are mutually exclusive.
No more than 20% of the cells should have an expected frequency less than five.
No cell should have an expected frequency less than one.
While the first three assumptions are met — with mutually exclusive research groups and role switch occurrence (recorded as FALSE if no switch occurred in each pair within each group, or TRUE if it did) — the last two assumptions need to be tested. To do this, I first build a contingency table:
contingency_table <- table(communication_rapport$swap_y_n,
communication_rapport$research_group)I then check the expected values for each cell:
chisq_test <- chisq.test(contingency_table)Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
incorrect
chisq_test$expected
Matched Autistic Matched Non-autistic Mixed
FALSE 13.830508 17.084746 17.084746
TRUE 3.169492 3.915254 3.915254
We can see that, while no cells have an expected frequency below 1, half of the cells have values below 5. This violates the assumption for the χ² test, so instead, I use Fisher’s Exact Test, which does not require this assumption:
rm(chisq_test, contingency_table)4.1.5.1 Conducting the Test
To conduct Fisher’s Exact Test, I use the fisher.test() function, which takes a contingency table as input and returns an exact p-value. If there’s sufficient evidence of a difference between the groups, I follow up with pairwise_fisher_test() to perform pairwise comparisons. To adjust for multiple comparisons, I apply the Benjamini-Hochberg method by setting p.adjust.method = "fdr".
While this post-hoc analysis does not produce 95% confidence intervals (CIs), it provides p-values indicating how likely the observed differences between each pair of groups are to occur if the null hypothesis were true and no actual differences existed.
4.1.6 p-values Interpretation
When interpreting p-values, I followed the guidelines from Ganesh and Cave (2018), who emphasise that p-values should be treated as continuous measures rather than strict thresholds for decision-making. Instead of relying solely on arbitrary cutoffs, I examined p-values in terms of the strength of evidence they provide, considering them alongside effect sizes and 95% CIs. This approach helps to avoid misinterpretation and provides a more comprehensive understanding of the data.
That said, it’s important to acknowledge the ongoing debate around the most meaningful and responsible ways to statistically analyse and present data — and I’m still learning.
Now that I’ve covered all the tests I’ll be using for this project, along with their assumptions and appropriate functions, I can move on to analysing my cleaned data and exploring the research questions I outlined at the start of the project.
4.2 Applying Methods to Research Questions
4.2.1 Does Interaction Length Vary by Group?
I start by testing if there is a difference in interaction length between three research groups using one-way ANOVA. Before running the test, I check for outliers and access whether the data meet the test assumptions. I do this by running three functions: identify_outliers(), plot_boxplots(), and run_diagnostic_plots().
# First Function - identifying outliers:
identify_outliers(communication_rapport, research_group, interaction_min)# A tibble: 6 × 3
research_group interaction_min outlier_type
<fct> <dbl> <chr>
1 Matched Autistic 9.18 Mild
2 Matched Autistic 9.49 Mild
3 Matched Non-autistic 14.6 Extreme
4 Matched Non-autistic 8.97 Mild
5 Mixed 6.81 Mild
6 Mixed 7.56 Mild
# Second Function - plotting data:
plot_boxplots(communication_rapport, research_group, interaction_min)Every group has two outliers. All of them are mild, except for one extreme outlier in the non-autistic group. These outliers could confound the results of the test.
The boxplots show that the autistic group has slightly greater variance than the other two groups. However, the data appear roughly normally distributed across all groups (when outliers aren’t considered).
Given these patterns, applying a data transformation might be a good idea here. To explore this further, I run diagnostic plots with both non-transformed and transformed data:
# Third Function - diagnostic plots:
run_diagnostic_plots(communication_rapport, "research_group", "interaction_min",
log_transformation = TRUE, sqrt_transformation = TRUE)Both normality and equal variance assumptions look much better when the data is log-transformed. Thus, I decide to proceed with the original data, but to be cautious also run one-way anova with the log-transformed data to confirm my findings:
run_one_way_anova(communication_rapport, "research_group", "interaction_min", "original")Analysis of Variance Table
Response: data[[response_var]]
Df Sum Sq Mean Sq F value Pr(>F)
data[[group_var]] 2 15.23 7.6159 1.3158 0.2764
Residuals 56 324.14 5.7882
The one-way ANOVA on the original data yields a high p-value of 0.28 indicating that there is not enough evidence to suggest that interaction length was different between the research groups.
run_one_way_anova(communication_rapport, "research_group", "interaction_min", "log")Analysis of Variance Table
Response: log(data[[response_var]])
Df Sum Sq Mean Sq F value Pr(>F)
data[[group_var]] 2 1.0452 0.52262 1.7205 0.1883
Residuals 56 17.0106 0.30376
The log-transformed data produce a slightly lower p-value of 0.19. Nonetheless, the p-value is still too large to claim that there is any real difference between the groups.
In conclusion, there is not enough evidence to suggest a difference in interaction length between groups. One possible explanation is that participants’ interactions were task-driven, with interaction length largely determined by task completion rather than group differences. Still, it is interesting that participants took a similar amount of time to complete the task, especially given evidence that information transfer in mixed pairs (autistic–non-autistic) is less efficient compared to matched pairs (Crompton et al, 2020). A potentially useful way to explore this may be to look not only at whether the task was completed, but also how accurately it was completed across groups. Unfortunately, this remains an open question, as it falls beyond the scope of my analysis and this research project.
4.2.2 Do Participants Report Higher Rapport in Same-Neurotype Pairs?
Next, I analyse whether there is a difference in rapport scores between different groups using one-way ANOVA. Again, let’s first use our three functions to see if data meet all requirements for the test:
identify_outliers(communication_rapport, research_group, rapport_score)# A tibble: 3 × 3
research_group rapport_score outlier_type
<fct> <dbl> <chr>
1 Matched Autistic 250. Extreme
2 Matched Autistic 347 Mild
3 Matched Non-autistic 274 Mild
plot_boxplots(communication_rapport, research_group, rapport_score)Autistic and non-autistic groups have a few outliers with the autistic group having one extreme outlier. After investigating boxplots, I have a concern that the outlier in autistic group can skew results significantly and negate a real difference that seems to exist between three groups.
run_diagnostic_plots(communication_rapport, "research_group", "rapport_score",
log_transformation = TRUE, sqrt_transformation = TRUE)Diagnostic plots show that none of the transformation methods helped remove the outliers. The good news is that the original data appears consistent enough with the assumptions of equal variance and normality. I thus have decided to proceed with the original data, but also double-check the findings by running a one-way ANOVA with the same data but excluding the extreme outlier in the autistic group.
run_one_way_anova(communication_rapport, "research_group", "rapport_score", "original")Analysis of Variance Table
Response: data[[response_var]]
Df Sum Sq Mean Sq F value Pr(>F)
data[[group_var]] 2 29357 14679 5.1813 0.008618 **
Residuals 56 158649 2833
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value from the original data is quite low — 0.009 — suggesting strong evidence for a difference between groups. To investigate this further, I ran a Tukey HSD post-hoc test:
run_tukey_hsd(communication_rapport, "research_group", "rapport_score", "original") research_group emmean SE df lower.CL upper.CL
Matched Autistic 425 12.9 56 399 451
Matched Non-autistic 432 11.6 56 409 455
Mixed 383 11.6 56 359 406
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
Matched Autistic - (Matched Non-autistic) -6.87 17.4 56 -0.396 0.9175
Matched Autistic - Mixed 42.44 17.4 56 2.444 0.0459
(Matched Non-autistic) - Mixed 49.31 16.4 56 3.002 0.0110
P value adjustment: tukey method for comparing a family of 3 estimates
contrast estimate SE df lower.CL upper.CL
Matched Autistic - (Matched Non-autistic) -6.87 17.4 56 -48.68 34.9
Matched Autistic - Mixed 42.44 17.4 56 0.63 84.2
(Matched Non-autistic) - Mixed 49.31 16.4 56 9.76 88.9
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
The post-hoc test shows no evidence of a difference between autistic and non-autistic groups (p = 0.9175). However, there is strong evidence for a difference between non-autistic and mixed groups (p = 0.0110), with rapport scores differing by 49.31 (± 16.4; 95% CI: 9.76 to 88.86). It also indicates moderate evidence for a difference between autistic and mixed groups (p = 0.0459), with a difference estimate of 42.44 (± 17.4; CI: 0.63 to 84.25).
This is a bit surprising when looking at the boxplots, which show a clearer difference between autistic and mixed groups, and a less pronounced difference between non-autistic and mixed groups. As mentioned earlier, one-way ANOVA is quite sensitive to outliers, so it’s likely that the extreme outlier in the autistic group has confounded the results. For this reason, I’m going to remove the extreme outlier and see how it affects the findings. I’ll temporarily create a new dataset without the outlier - temp_rapport_n_outlier - using the filter() function:
temp_rapport_n_outlier <- communication_rapport |>
# Removing the extreme outlier:
filter(!(research_group == "Autistic" & rapport_score == 249.5))As before, I’ll first run a one-way ANOVA and then, if the p-value is still low, a Tukey HSD post-hoc test:
run_one_way_anova(temp_rapport_n_outlier, "research_group", "rapport_score", "original")Analysis of Variance Table
Response: data[[response_var]]
Df Sum Sq Mean Sq F value Pr(>F)
data[[group_var]] 2 29357 14679 5.1813 0.008618 **
Residuals 56 158649 2833
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The new p-value is even smaller — now at 0.001.
run_tukey_hsd(temp_rapport_n_outlier, "research_group", "rapport_score", "original") research_group emmean SE df lower.CL upper.CL
Matched Autistic 425 12.9 56 399 451
Matched Non-autistic 432 11.6 56 409 455
Mixed 383 11.6 56 359 406
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
Matched Autistic - (Matched Non-autistic) -6.87 17.4 56 -0.396 0.9175
Matched Autistic - Mixed 42.44 17.4 56 2.444 0.0459
(Matched Non-autistic) - Mixed 49.31 16.4 56 3.002 0.0110
P value adjustment: tukey method for comparing a family of 3 estimates
contrast estimate SE df lower.CL upper.CL
Matched Autistic - (Matched Non-autistic) -6.87 17.4 56 -48.68 34.9
Matched Autistic - Mixed 42.44 17.4 56 0.63 84.2
(Matched Non-autistic) - Mixed 49.31 16.4 56 9.76 88.9
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
This time, the Tukey HSD test shows strong evidence (p < 0.01) of a difference between autistic and mixed pairs, with a difference of 53.42 (± 15.9; 95% CI: 15.2 to 91.7). The comparisons between non-autistic and autistic pairs, and between non-autistic and mixed pairs, remained approximately the same.
In conclusion, there is strong evidence that rapport scores are higher in matched pairs compared to mixed pairs. This aligns with previous research showing that autistic and non-autistic individuals tend to experience higher rapport when interacting with someone of the same neurotype. What might explain this difference? Could it be that engagement is higher in matched pairs, or that backchanneling is more frequent? Let’s remove the no longer necessary temp_rapport_n_outlier and explore these questions.
rm(temp_rapport_n_outlier)4.2.3 Does Verbal Engagement Differ Across Groups?
I first look at the engagement aspect of the communication and check if it varies across groups using one-way ANOVA. As always, I begin by running the identify_outliers() and plot_boxplots() functions, followed by run_diagnostic_plots() to check whether the data meets the assumptions of the test.
identify_outliers(communication_rapport, research_group, engagement_prop)# A tibble: 0 × 3
# ℹ 3 variables: research_group <fct>, engagement_prop <dbl>,
# outlier_type <chr>
plot_boxplots(communication_rapport, research_group, engagement_prop)There are no outliers in any of the groups, and it seems that the variance between the groups is largely similar, with values that appear roughly normally distributed. At this point, I have a slight concern about the autistic group, since the variance in this group looks slightly different from the other two. Let’s investigate further using the run_diagnostic_plots() function:
run_diagnostic_plots(communication_rapport, "research_group", "engagement_prop")Turns out my concerns were unnecessary — the assumptions of equal variance and normality look good, so I can go ahead and run a one-way ANOVA without any transformations.
run_one_way_anova(communication_rapport, "research_group", "engagement_prop", "original")Analysis of Variance Table
Response: data[[response_var]]
Df Sum Sq Mean Sq F value Pr(>F)
data[[group_var]] 2 0.6943 0.34717 5.9012 0.004725 **
Residuals 56 3.2945 0.05883
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is low — 0.005 — indicating strong evidence for a difference in engagement across groups.
run_tukey_hsd(communication_rapport, "research_group", "engagement_prop", "original") research_group emmean SE df lower.CL upper.CL
Matched Autistic 0.465 0.0588 56 0.347 0.583
Matched Non-autistic 0.592 0.0529 56 0.486 0.698
Mixed 0.335 0.0529 56 0.229 0.441
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
Matched Autistic - (Matched Non-autistic) -0.127 0.0791 56 -1.600 0.2541
Matched Autistic - Mixed 0.131 0.0791 56 1.650 0.2337
(Matched Non-autistic) - Mixed 0.257 0.0749 56 3.435 0.0032
P value adjustment: tukey method for comparing a family of 3 estimates
contrast estimate SE df lower.CL upper.CL
Matched Autistic - (Matched Non-autistic) -0.127 0.0791 56 -0.3171 0.0639
Matched Autistic - Mixed 0.131 0.0791 56 -0.0600 0.3210
(Matched Non-autistic) - Mixed 0.257 0.0749 56 0.0769 0.4374
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
Post-hoc analysis shows strong evidence (p-value = 0.003) that participants verbally interacted with their conversation partners a higher proportion of time in non-autistic pairs compared to mixed pairs, with a difference of 0.258 (± SE: 0.075; 95% CIs: 0.078 to 0.438). There was insufficient evidence for a difference between autistic and non-autistic pairs (p-value = 0.2337), or between autistic and mixed pairs (p-value = 0.2541).
In conclusion, there is strong evidence that participants engaged a higher proportion of time with their interaction partners in non-autistic pairs compared to mixed pairs. Interestingly, I did not find sufficient evidence for a difference between autistic and mixed pairs. At the same time, I also did not find sufficient evidence that engagement was higher in non-autistic pairs compared to autistic pairs.
Why is that? One explanation could be that autistic individuals engage less with others regardless of their partner’s ASC status. If that’s the case, we would expect to see that autistic individuals’ engagement is lower compared to non-autistic individuals within mixed pairs. Let’s check if that’s the case by running the Fisher-Pitman permutation test. Before running the permutation test, I am going to check if the engagement’s distribution across groups is similar:
check_distribution_in_mixed_pairs(communication_rapport, engagement_prop)The distributions look roughly similar. Although, the sample size for each group is very small, which means the findings should be taken with a pinch of salt. Nonetheless, let’s run the test:
oneway_test(engagement_prop ~ asc_status,
data = communication_rapport |> filter(research_group == "Mixed"))
Asymptotic Two-Sample Fisher-Pitman Permutation Test
data: engagement_prop by asc_status (Autistic, Non-autistic)
Z = 0.31714, p-value = 0.7511
alternative hypothesis: true mu is not equal to 0
The test yielded the p-value of 0.75, indicating there isn’t sufficient evidence for a difference between the two groups.
These results suggest that both autistic and non-autistic individuals engage less with one another during mixed interactions.
Let’s now investigate how engagement impacts the rapport-building both overall and within each group.
4.2.4 Does Higher Verbal Engagement Predict Higher Rapport?
To test whether participant’s verbal engagement affects their partner’s rapport, I use simple linear regression. Before running the analysis, I check that the data met all necessary assumptions using the plot_scatter_with_fitted_line() and run_diagnostic_plots() functions:
plot_scatter_with_fitted_line(communication_rapport, engagement_prop, rapport_score)The scatterplot shows a roughly linear relationship between participant’s engagement and their interaction partner’s rapport.
run_diagnostic_plots(communication_rapport, "engagement_prop", "rapport_score")The diagnostic plots show that data meets the assumptions of homoscedasticity and normality. However, both the scatterplot and the diagnostic plots show a few outliers, which could skew the regression results. So, I first ran the analysis on the full dataset, and then again with the outliers removed.
run_simple_linear_regression(communication_rapport, "engagement_prop", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-154.385 -38.288 6.964 36.882 100.736
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 383.77 14.66 26.185 <2e-16 ***
data[[predict_var]] 61.96 27.56 2.248 0.0284 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 55.04 on 57 degrees of freedom
Multiple R-squared: 0.08145, Adjusted R-squared: 0.06534
F-statistic: 5.054 on 1 and 57 DF, p-value: 0.02845
Simple linear regression provides moderate evidence (p = 0.0283) that the rapport of participant’s partner increases with the participant’s verbal interaction. However, the R-squared value is only 0.065, with a slope of 62 (± 27.54 SE) and intercept of 383.70 (± 14.67 SE).
To confirm the results, I re-ran the test after removing the two most prominent outliers (those with rapport scores below 300). Since removing just two points doesn’t strongly affect assumptions, I proceed with the same model:
run_simple_linear_regression(communication_rapport |> filter(rapport_score > 300),
"engagement_prop", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-88.186 -39.063 2.439 32.303 95.563
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 388.81 12.94 30.05 <2e-16 ***
data[[predict_var]] 62.52 24.32 2.57 0.0129 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47.78 on 55 degrees of freedom
Multiple R-squared: 0.1072, Adjusted R-squared: 0.091
F-statistic: 6.606 on 1 and 55 DF, p-value: 0.0129
This time, the regression returns slightly stronger evidence (p = 0.013), suggesting that the initial result wasn’t driven by outliers.
In conclusion, there is moderate-to-strong evidence that the proportion of time participants verbally engage with their interaction partners affects the rapport scores given by them. However, the effect size is quite small – on average, a 10% increase in verbal interaction is associated with only a 6.2 unit increase in rapport, indicating that increase in the engagement can explain only a small fraction of the rapport-building.
Next, I look into whether this pattern holds across the different neurotype groups. Before running the analysis, I check whether the assumptions of linear regression are met for each group individually, starting with the autistic group:
# Autistic Group:
plot_scatter_with_fitted_line(communication_rapport |>
filter(research_group == "Matched Autistic"),
engagement_prop, rapport_score,
log_transformation = TRUE,
sqrt_transformation = TRUE)For the autistic group, the assumption of linearity appears to be met (more or less) when rapport scores are square root transformed.
run_diagnostic_plots(communication_rapport |>
filter(research_group == "Matched Autistic"),
"engagement_prop", "rapport_score",
sqrt_transformation = TRUE)The assumptions of normality and homoscedasticity look good. However, there is a strong outlier that may skew the results, so I decide to analyse the data both with and without this outlier.
Next, I examine the non-autistic and mixed groups:
# Non-autistic Group:
plot_scatter_with_fitted_line(communication_rapport |>
filter(research_group == "Matched Non-autistic"),
engagement_prop, rapport_score,
log_transformation = TRUE,
sqrt_transformation = TRUE)# Mixed Group:
plot_scatter_with_fitted_line(communication_rapport |>
filter(research_group == "Mixed"),
engagement_prop, rapport_score,
log_transformation = TRUE,
sqrt_transformation = TRUE)For the non-autistic group, the assumption of linearity is not met, and neither transformation is helpful. For the mixed group, the relationship became linear with a square root transformation.
run_diagnostic_plots(communication_rapport |>
filter(research_group == "Mixed"),
"engagement_prop", "rapport_score",
sqrt_transformation = TRUE)However, the assumption of homoscedasticity is violated in the mixed group. Despite this, I proceed with the analysis, keeping this violation in mind.
I run simple linear regression first with the autistic and then with the mixed group.
# Autistic Group (with the outlier)
run_simple_linear_regression(communication_rapport |>
filter(research_group == "Matched Autistic"),
"engagement_prop", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-148.14 -19.36 12.21 23.86 66.70
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 374.99 26.98 13.900 5.66e-10 ***
data[[predict_var]] 107.86 51.95 2.076 0.0555 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 49.4 on 15 degrees of freedom
Multiple R-squared: 0.2232, Adjusted R-squared: 0.1715
F-statistic: 4.311 on 1 and 15 DF, p-value: 0.05547
# Autistic Group (without the outlier)
run_simple_linear_regression(communication_rapport |>
filter(research_group == "Matched Autistic" &
rapport_score > 300),
"engagement_prop", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-71.157 -15.296 7.931 14.249 45.356
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 407.40 16.61 24.533 6.64e-13 ***
data[[predict_var]] 59.75 31.17 1.917 0.0759 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 28.49 on 14 degrees of freedom
Multiple R-squared: 0.2079, Adjusted R-squared: 0.1513
F-statistic: 3.674 on 1 and 14 DF, p-value: 0.0759
# Mixed Group:
run_simple_linear_regression(communication_rapport |>
filter(research_group == "Mixed"),
"engagement_prop", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-71.660 -33.820 6.768 28.099 87.822
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 386.026 17.077 22.605 3.4e-15 ***
data[[predict_var]] -9.823 42.573 -0.231 0.82
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 43.11 on 19 degrees of freedom
Multiple R-squared: 0.002794, Adjusted R-squared: -0.04969
F-statistic: 0.05324 on 1 and 19 DF, p-value: 0.82
For the mixed group, there was not enough evidence (p = 0.82) of a relationship between engagement and rapport. For the autistic group, there was weak evidence when the outlier was included (p = 0.05), but this dropped to insufficient evidence once the outlier was removed (p = 0.1513).
In conclusion, there is insufficient evidence to suggest that higher verbal engagement is associated with higher rapport scores in either group - a pattern that was found when when the data from all the groups was combined. Importantly, these findings should be interpreted with caution due to the small sample sizes within each group.
4.2.5 Does Role Switching Happen Equally Often Across Groups?
Additionally, I decided it would be interesting to explore whether the frequency of interaction role switching differ between the groups, using Fisher’s Exact Test.
To run the test, I first build a 2x3 contingency table. Each cell represents the crossover between the research group (autistic, non-autistic, or mixed) and whether a role switch occurs (TRUE) or does not occur (FALSE) (with the number indicating total number of pairs for each crossover).
contingency_table <- table(communication_rapport$swap_y_n,
communication_rapport$research_group)
contingency_table
Matched Autistic Matched Non-autistic Mixed
FALSE 17 14 17
TRUE 0 7 4
I then apply Fisher’s Exact Test with the Freeman-Halton extension to this table:
fisher.test(contingency_table)
Fisher's Exact Test for Count Data
data: contingency_table
p-value = 0.02559
alternative hypothesis: two.sided
The test shows moderate evidence (p-value = 0.025) for a difference between the groups. To explore this further, I run post-hoc pairwise comparisons:
pairwise_fisher_test(contingency_table, p.adjust.method = "fdr")# A tibble: 3 × 6
group1 group2 n p p.adj p.adj.signif
* <chr> <chr> <int> <dbl> <dbl> <chr>
1 Matched Autistic Matched Non-autistic 38 0.0108 0.0324 *
2 Matched Autistic Mixed 38 0.113 0.17 ns
3 Matched Non-autistic Mixed 42 0.484 0.484 ns
The comparison reveals moderate evidence (p-value = 0.0324) for a difference between the autistic and non-autistic groups, with role switching occurring more frequently in the non-autistic group than in the autistic group (19% vs 0%).
In conclusion, there is some evidence that non-autistic adults are more likely to switch interaction roles when paired with other non-autistic adults compared to autistic pairs. This aligns with expectations, as a preference for repetitive and restricted behaviour is a common autistic trait.
Interestingly, in mixed pairs, both autistic and non-autistic participants show instances of role switching, suggesting that the presence of a non-autistic partner may influence the interaction dynamics for autistic individuals:
communication_rapport |> filter(research_group == "Mixed" & swap_y_n == TRUE)# A tibble: 4 × 8
participant_id asc_status research_group interaction_min engagement_prop
<dbl> <fct> <fct> <dbl> <dbl>
1 11 Non-autistic Mixed 4.21 0.53
2 34 Autistic Mixed 6.81 0.07
3 61 Non-autistic Mixed 3.59 0.18
4 64 Autistic Mixed 3.64 0.57
# ℹ 3 more variables: backchanneling_freq <dbl>, swap_y_n <lgl>,
# rapport_score <dbl>
4.2.6 Does Backchanneling Differ Across Groups?
The analysis of the backchanneling data follows the same approach as the engagement data. To avoid repeating myself, I’ve chosen not to annotate the code here. If you’d like to review it, the full code is available in the Code Appendix below, though it won’t include a written explanation of my reasoning. For that, you can refer to my Honours Project (Section Results: Rate of Verbal Backchannel Response).
6 Final Note
Thank you for taking the time to explore this project. It was my first time working with R, and although it took a lot of time and energy, it was a genuinely valuable learning journey.
If you have any questions, feedback, or if you’d like support with your own data analysis, feel free to get in touch: 📧 daniilssmirnovs11@gmail.com
Thanks again — and as a final thought, all I can say is: Rrrrr!
7 Code Appendix
7.1 Analysing Data
7.1.1 Analysing Backchanneling
identify_outliers(communication_rapport, research_group, backchanneling_freq)# A tibble: 0 × 3
# ℹ 3 variables: research_group <fct>, backchanneling_freq <dbl>,
# outlier_type <chr>
plot_boxplots(communication_rapport, research_group, backchanneling_freq)run_diagnostic_plots(communication_rapport, "research_group", "backchanneling_freq")run_one_way_anova(communication_rapport, "research_group",
"backchanneling_freq", "original")Analysis of Variance Table
Response: data[[response_var]]
Df Sum Sq Mean Sq F value Pr(>F)
data[[group_var]] 2 140.59 70.295 8.0656 0.0008352 ***
Residuals 56 488.06 8.715
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
run_tukey_hsd(communication_rapport, "research_group",
"backchanneling_freq", "original") research_group emmean SE df lower.CL upper.CL
Matched Autistic 4.37 0.716 56 2.94 5.80
Matched Non-autistic 7.07 0.644 56 5.78 8.37
Mixed 3.56 0.644 56 2.27 4.85
Confidence level used: 0.95
contrast estimate SE df t.ratio p.value
Matched Autistic - (Matched Non-autistic) -2.704 0.963 56 -2.808 0.0185
Matched Autistic - Mixed 0.811 0.963 56 0.842 0.6786
(Matched Non-autistic) - Mixed 3.515 0.911 56 3.858 0.0009
P value adjustment: tukey method for comparing a family of 3 estimates
contrast estimate SE df lower.CL upper.CL
Matched Autistic - (Matched Non-autistic) -2.704 0.963 56 -5.02 -0.385
Matched Autistic - Mixed 0.811 0.963 56 -1.51 3.130
(Matched Non-autistic) - Mixed 3.515 0.911 56 1.32 5.709
Confidence level used: 0.95
Conf-level adjustment: tukey method for comparing a family of 3 estimates
check_distribution_in_mixed_pairs(communication_rapport, backchanneling_freq)oneway_test(backchanneling_freq ~ asc_status,
data = communication_rapport |> filter(research_group == "Mixed"))
Asymptotic Two-Sample Fisher-Pitman Permutation Test
data: backchanneling_freq by asc_status (Autistic, Non-autistic)
Z = 0.40616, p-value = 0.6846
alternative hypothesis: true mu is not equal to 0
7.1.2 Analysing Backchaneling and Rapport Relationship
plot_scatter_with_fitted_line(communication_rapport, backchanneling_freq, rapport_score)run_diagnostic_plots(communication_rapport, "backchanneling_freq", "rapport_score")run_simple_linear_regression(communication_rapport, "backchanneling_freq", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-158.427 -31.179 8.097 39.858 104.070
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 393.606 13.435 29.296 <2e-16 ***
data[[predict_var]] 3.749 2.236 1.677 0.0991 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 56.07 on 57 degrees of freedom
Multiple R-squared: 0.047, Adjusted R-squared: 0.03028
F-statistic: 2.811 on 1 and 57 DF, p-value: 0.0991
run_simple_linear_regression(communication_rapport |>
filter(
rapport_score > 300 &
!(backchanneling_freq < 2.5 &
rapport_score > 470) &
!(backchanneling_freq > 10 &
rapport_score < 350)
),
"backchanneling_freq", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-92.42 -32.79 10.66 31.15 76.82
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 385.182 11.064 34.815 < 2e-16 ***
data[[predict_var]] 6.566 1.878 3.496 0.000965 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.23 on 53 degrees of freedom
Multiple R-squared: 0.1874, Adjusted R-squared: 0.172
F-statistic: 12.22 on 1 and 53 DF, p-value: 0.0009649
plot_scatter_with_fitted_line(communication_rapport |>
filter(research_group == "Matched Autistic"),
backchanneling_freq, rapport_score,
log_transformation = TRUE,
sqrt_transformation = TRUE)plot_scatter_with_fitted_line(communication_rapport |>
filter(research_group == "Matched Autistic" &
rapport_score > 300),
backchanneling_freq, rapport_score,
log_transformation = TRUE,
sqrt_transformation = TRUE)run_diagnostic_plots(communication_rapport |>
filter(research_group == "Matched Autistic"),
"backchanneling_freq", "rapport_score")run_diagnostic_plots(communication_rapport |>
filter(research_group == "Matched Autistic" &
rapport_score > 300),
"backchanneling_freq", "rapport_score")plot_scatter_with_fitted_line(communication_rapport |>
filter(research_group == "Matched Non-autistic"),
backchanneling_freq, rapport_score,
log_transformation = TRUE,
sqrt_transformation = TRUE)run_diagnostic_plots(communication_rapport |>
filter(research_group == "Matched Non-autistic"),
"backchanneling_freq", "rapport_score")plot_scatter_with_fitted_line(communication_rapport |>
filter(research_group == "Mixed"),
backchanneling_freq, rapport_score,
log_transformation = TRUE,
sqrt_transformation = TRUE)# Autistic Group (with the outlier)
run_simple_linear_regression(communication_rapport |>
filter(research_group == "Matched Autistic"),
"backchanneling_freq", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-170.515 5.047 8.332 25.237 49.859
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 384.207 31.515 12.191 3.48e-09 ***
data[[predict_var]] 9.374 6.593 1.422 0.176
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 52.62 on 15 degrees of freedom
Multiple R-squared: 0.1188, Adjusted R-squared: 0.06001
F-statistic: 2.021 on 1 and 15 DF, p-value: 0.1756
# Autistic Group (without the outlier)
run_simple_linear_regression(communication_rapport |>
filter(research_group == "Matched Autistic" &
rapport_score > 300),
"backchanneling_freq", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-80.771 -7.477 1.796 14.225 37.715
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 401.797 16.617 24.179 8.1e-13 ***
data[[predict_var]] 7.800 3.438 2.269 0.0396 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 27.37 on 14 degrees of freedom
Multiple R-squared: 0.2689, Adjusted R-squared: 0.2166
F-statistic: 5.148 on 1 and 14 DF, p-value: 0.03961
# Non-autistic Group:
run_simple_linear_regression(communication_rapport |>
filter(research_group == "Matched Non-autistic"),
"backchanneling_freq", "rapport_score")
Call:
lm(formula = data[[response_var]] ~ data[[predict_var]], data = data)
Residuals:
Min 1Q Median 3Q Max
-160.24 -28.91 19.24 39.97 72.94
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 426.5794 29.9371 14.249 1.35e-11 ***
data[[predict_var]] 0.7729 3.7555 0.206 0.839
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 63.22 on 19 degrees of freedom
Multiple R-squared: 0.002224, Adjusted R-squared: -0.05029
F-statistic: 0.04236 on 1 and 19 DF, p-value: 0.8391